Urllib库的基本使用

转载1 博客园 python修行路：https://www.cnblogs.com/zhaof/p/6910871.html

转载2csdn 原文链接：https://blog.csdn.net/jiduochou963/java/article/details/87564467

官方文档地址：https://docs.python.org/3/library/urllib.html

什么是Urllib

Urllib是python内置的HTTP请求库
包括以下模块
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块

urlopen

关于urllib.request.urlopen参数的介绍：
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url参数的使用

先写一个简单的例子：

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))

urlopen一般常用的有三个参数，它的参数如下：
urllib.requeset.urlopen(url,data,timeout)
response.read()可以获取到网页的内容，如果没有read()，将返回如下内容

data参数的使用

上述的例子是通过请求百度的get请求获得百度，下面使用urllib的post请求
这里通过http://httpbin.org/post网站演示（该网站可以作为练习使用urllib的一个站点使用，可以
模拟各种请求操作）。

import urllib.parse

import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')

print(data)

response = urllib.request.urlopen('http://httpbin.org/post', data=data)

print(response.read())

这里就用到urllib.parse，通过bytes(urllib.parse.urlencode())可以将post数据进行转换放到urllib.request.urlopen的data参数中。这样就完成了一次post请求。
所以如果我们添加data参数的时候就是以post请求方式请求，如果没有data参数就是get请求方式

timeout参数的使用
在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给
请求设置一个超时时间，而不是让程序一直在等待结果。例子如下：

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)

print(response.read())

运行之后我们看到可以正常的返回结果，接着我们将timeout时间设置为0.1
运行程序会提示如下错误

所以我们需要对异常进行抓取，代码更改为

import socket

import urllib.request

import urllib.error

try:

    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)

except urllib.error.URLError as e:

    if isinstance(e.reason, socket.timeout):

        print('TIME OUT')

响应

响应类型、状态码、响应头

import urllib.request

response = urllib.request.urlopen('https://www.python.org')

print(type(response))

可以看到结果为：<class 'http.client.httpresponse'="">
我们可以通过response.status、response.getheaders().response.getheader("server")，获取状态码以及头部信息
response.read()获得的是响应体的内容

当然上述的urlopen只能用于一些简单的请求，因为它无法添加一些header信息，如果后面写爬虫我们可以知道，很多情况下我们是需要添加头部信息去访问目标站的，这个时候就用到了urllib.request

request

设置Headers
有很多网站为了防止程序爬虫爬网站造成网站瘫痪，会需要携带一些headers头部信息才能访问，最长见的有user-agent参数

写一个简单的例子：

import urllib.request

request = urllib.request.Request('https://python.org')

response = urllib.request.urlopen(request)

print(response.read().decode('utf-8'))

给请求添加头部信息，从而定制自己请求网站是时的头部信息

from urllib import request, parse

url = 'http://httpbin.org/post'

headers = {

    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',

    'Host': 'httpbin.org'

}

dict = {

    'name': 'zhaofan'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

req = request.Request(url=url, data=data, headers=headers, method='POST')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

添加请求头的第二种方式

from urllib import request, parse

url = 'http://httpbin.org/post'

dict = {

    'name': 'Germey'

}

data = bytes(parse.urlencode(dict), encoding='utf8')

req = request.Request(url=url, data=data, method='POST')

req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

response = request.urlopen(req)

print(response.read().decode('utf-8'))

这种添加方式有个好处是自己可以定义一个请求头字典，然后循环进行添加

高级用法各种handler

代理,ProxyHandler

通过rulllib.request.ProxyHandler()可以设置代理,网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问,所以这个时候需要通过设置代理来爬取数据

import urllib.request

proxy_handler = urllib.request.ProxyHandler({

    'http': 'http://127.0.0.1:9743',

    'https': 'https://127.0.0.1:9743'

})

opener = urllib.request.build_opener(proxy_handler)

response = opener.open('http://httpbin.org/get')

print(response.read())

cookie,HTTPCookiProcessor

cookie中保存中我们常见的登录信息，有时候爬取网站需要携带cookie信息访问,这里用到了http.cookijar，用于获取cookie以及存储cookie

import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

for item in cookie:

    print(item.name+"="+item.value)

同时cookie可以写入到文件中保存，有两种方式http.cookiejar.MozillaCookieJar和http.cookiejar.LWPCookieJar()，当然你自己用哪种方式都可以

具体代码例子如下：
http.cookiejar.MozillaCookieJar()方式

import http.cookiejar, urllib.request

filename = "cookie.txt"

cookie = http.cookiejar.MozillaCookieJar(filename)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True, ignore_expires=True)

http.cookiejar.LWPCookieJar()方式

import http.cookiejar, urllib.request

filename = 'cookie.txt'

cookie = http.cookiejar.LWPCookieJar(filename)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True, ignore_expires=True)

同样的如果想要通过获取文件中的cookie获取的话可以通过load方式，当然用哪种方式写入的，就用哪种方式读取。

import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

异常处理

在很多时候我们通过程序访问页面的时候，有的页面可能会出现错误，类似404，500等错误
这个时候就需要我们捕捉异常，下面先写一个简单的例子

from urllib import request,error

try:

    response = request.urlopen("http://pythonsite.com/1111.html")

except error.URLError as e:

    print(e.reason)

上述代码访问的是一个不存在的页面，通过捕捉异常，我们可以打印异常错误

这里我们需要知道的是在urllb异常这里有两个个异常错误：
URLError,HTTPError，HTTPError是URLError的子类

URLError里只有一个属性：reason,即抓异常的时候只能打印错误信息，类似上面的例子

HTTPError里有三个属性：code,reason,headers，即抓异常的时候可以获得code,reson，headers三个信息，例子如下：

from urllib import request,error

try:

    response = request.urlopen("http://pythonsite.com/1111.html")

except error.HTTPError as e:

    print(e.reason)

    print(e.code)

    print(e.headers)

except error.URLError as e:

    print(e.reason)

else:

    print("reqeust successfully")

同时，e.reason其实也可以在做深入的判断，例子如下：

import socket

from urllib import error,request

try:

    response = request.urlopen("http://www.pythonsite.com/",timeout=0.001)

except error.URLError as e:

    print(type(e.reason))

    if isinstance(e.reason,socket.timeout):

        print("time out")

URL解析

urlparse
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

功能一：

from urllib.parse import urlparse

result = urlparse("http://www.baidu.com/index.html;user?id=5#comment")

print(result)

结果为：

这里就是可以对你传入的url地址进行拆分
同时我们是可以指定协议类型：
result = urlparse("www.baidu.com/index.html;user?id=5#comment",scheme="https")
这样拆分的时候协议类型部分就会是你指定的部分，当然如果你的url里面已经带了协议，你再通过scheme指定的协议就不会生效

urlunpars

其实功能和urlparse的功能相反，它是用于拼接，例子如下：

from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=123','commit']

print(urlunparse(data))

结果如下

urljoin

这个的功能其实是做拼接的，例子如下：

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))

print(urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html'))

print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html?question=2'))

print(urljoin('http://www.baidu.com?wd=abc', 'https://pythonsite.com/index.php'))

print(urljoin('http://www.baidu.com', '?category=2#comment'))

print(urljoin('www.baidu.com', '?category=2#comment'))

print(urljoin('www.baidu.com#comment', '?category=2'))

结果为：

从拼接的结果我们可以看出，拼接的时候后面的优先级高于前面的url

urlencode
这个方法可以将字典转换为url参数，例子如下

from urllib.parse import urlencode

params = {

    "name":"zhaofan",

    "age":23,

}

base_url = "http://www.baidu.com?"

url = base_url+urlencode(params)

print(url)

结果为：

模块一：urllib.request

urllibz中，request这个模块主要负责构造和发起网络请求，并在其中加入Headers，Proxy等。

发起GET请求：

1. 发起GET请求

1 from urllib import  request

2

3 resp = request.urlopen('http://www.baidu.com')

4 print(resp.read().decode('UTF-8'))

在urlopen()方法中传入字符串格式的url地址，则此方法会访问目标网址，然后返回访问的结果。

返回的结果会是一个http.client.HTTPResponse对象，使用此对象的read()方法可以获取访问网页获得的数据。但是要注意的是，获得的数据会是bytes的二进制格式，所以需要decode()一下，转换成字符串格式。

2. 发起`POST`请求

urlopen()默认的访问方式是GET，当在urlopen()方法中传入data参数时，则会发起POST请求。注意：传递的data数据需要为bytes格式。timeout参数还可以设置超时时间，如果请求时间超出，那么就会抛出异常。

 from urllib import request

 resp = request.urlopen('http://httpbin.org/post', data=b'word=hello', timeout=10)

 print(resp.read().decode())

3. 添加Headers
通过urllib发起的请求会有默认的一个Headers："User-Agent":"Python-urllib/3.6"，指明请求是由urllib发送的。
所以遇到一些验证User-Agent的网站时，我们需要自定义Headers，而这需要借助于urllib.request中的Request对象。

 from urllib import request

 url = 'http://httpbin.org/get'

 headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

 # 需要使用url和headers生成一个Request对象，然后将其传入urlopen方法中

 req = request.Request(url, headers=headers)

 resp = request.urlopen(req)

 print(resp.read().decode())

4. Request对象

如上所示，urlopen()方法中不止可以传入字符串格式的url，也可以传入一个Request对象来扩展功能，Request对象如下所示。

class urllib.request.Request(url, data=None, headers={},

                             origin_req_host=None,

                             unverifiable=False, method=None)

构造Request对象必须传入url参数，data数据和headers都是可选的。
最后，Request方法可以使用method参数来自由选择请求的方法，如PUT，DELETE等等，默认为GET

5. 添加Cookie

为了在请求时能带上Cookie信息，我们需要重新构造一个opener。
使用request.build_opener方法来进行构造opener，将我们想要传递的cookie配置到opener中，然后使用这个opener的open方法来发起请求。

 from http import cookiejar

 from urllib import request

 url = 'https://www.baidu.com'

 # 创建一个cookiejar对象

 cookie = cookiejar.CookieJar()

 # 使用HTTPCookieProcessor创建cookie处理器

 cookies = request.HTTPCookieProcessor(cookie)

 # 并以它为参数创建Opener对象

 opener = request.build_opener(cookies)

 # 使用这个opener来发起请求

 resp = opener.open(url)

 # 查看之前的cookie对象，则可以看到访问百度获得的cookie

 for i in cookie:

     print(i)

或者也可以把这个生成的opener使用install_opener方法来设置为全局的，之后使用urlopen方法发起请求时，都会带上这个cookie：

# 将这个opener设置为全局的opener

request.install_opener(opener)

resp = request.urlopen(url)

6. 设置Proxy代理

使用爬虫来爬取数据的时候，常常需要使用代理来隐藏我们的真实IP

 from urllib import request

 url = 'http://httpbin.org/ip'

 proxy = {'http':'218.18.232.26:80','https':'218.18.232.26:80'}

 # 创建代理处理器

 proxies = request.ProxyHandler(proxy)

 # 创建opener对象

 opener = request.build_opener(proxies)

 resp = opener.open(url)

 print(resp.read().decode())

7. 下载数据到本地

在我们进行网络请求时常常需要保存图片或音频等数据到本地，一种方法是使用python的文件操作，将read()获取的数据保存到文件中。
而urllib提供了一个urlretrieve()方法，可以简单的直接将请求获取的数据保存成文件

 from urllib import request

 url = 'http://python.org/'

 # urlretrieve()方法传入的第二个参数为文件保存的位置，以及文件名。

 request.urlretrieve(url, 'python.html')

注：urlretrieve()方法是Python2.x直接移植过来的方法，以后有可能在某个版本中弃用

模块二、urllib.response

在使用urlopen()方法或者opener的open()方法发起请求后，获得的结果是一个response对象。这个对象有一些方法和属性，可以让我们对请求返回的结果进行一些处理。

read()：获取响应返回的数据，只能使用一次。

getcode()：获取服务器返回的状态码。

getheaders()：获取返回响应的响应报头。

geturl()：获取访问的url。

模块三、urllib.parse

urllib.parse是urllib中用来解析各种数据格式的模块。

1. urllib.parse.quote

在url中，是只能使用ASCII中包含的字符的，也就是说，ASCII不包含的特殊字符，以及中文等字符都是不可以在url中使用的。而我们有时候又有将中文字符加入到url中的需求，例如百度的搜索地址：https://www.baidu.com/s?wd=南北。?之后的wd参数，则是我们搜索的关键词。那么我们实现的方法就是将特殊字符进行url编码，转换成可以url可以传输的格式，urllib中可以使用quote()方法来实现这个功能

 >>> from urllib import parse

 >>> keyword = '南北'

 >>> parse.quote(keyword)

 '%E5%8D%97%E5%8C%97'

如果需要将编码后的数据转换回来，可以使用unquote()方法。

>>> parse.unquote('%E5%8D%97%E5%8C%97')

2 '南北'

2. `urllib.parse.urlencode`

在访问url时，我们常常需要传递很多的url参数，而如果用字符串的方法去拼接url的话，会比较麻烦，所以urllib中提供了urlencode这个方法来拼接url参数。

 >>> from urllib import parse

 >>> params = {'wd': '南北', 'code': '', 'height': ''}

 >>> parse.urlencode(params)

 'wd=%E5%8D%97%E5%8C%97&code=1&height=188'

模块四 urllib.error

在urllib中主要设置了两个异常，一个是URLError，一个是HTTPError，HTTPError是URLError的子类。

HTTPError还包含了三个属性：

code：请求的状态码
reason：错误的原因
headers：响应的报头
例子：

In [1]: from urllib.error import HTTPError

In [2]: try:
...: request.urlopen('https://www.jianshu.com')
...: except HTTPError as e:
...: print(e.code)

403

————————————————
版权声明：本文为CSDN博主「IoneFine」的原创文章，遵循CC 4.0 BY-SA版权协议，
原文链接：https://blog.csdn.net/jiduochou963/java/article/details/87564467

巴特西

urllib全解

Urllib库的基本使用

什么是Urllib

urlopen

响应

request

异常处理

URL解析

模块一：urllib.request

1. 发起GET请求

2. 发起`POST`请求

4. Request对象

5. 添加Cookie

6. 设置Proxy代理

7. 下载数据到本地

模块二、urllib.response

模块三、urllib.parse

1. urllib.parse.quote

2. `urllib.parse.urlencode`

模块四 urllib.error

最新文章

热门文章

巴特西

urllib全解

Urllib库的基本使用

什么是Urllib

urlopen

响应

request

异常处理

URL解析

模块一：urllib.request

1. 发起GET请求

2. 发起POST请求

4. Request对象

5. 添加Cookie

6. 设置Proxy代理

7. 下载数据到本地

模块二、urllib.response

模块三、urllib.parse

1. urllib.parse.quote

2. urllib.parse.urlencode

模块四 urllib.error

最新文章

热门文章

2. 发起`POST`请求

2. `urllib.parse.urlencode`