Python爬虫iP被封的怎么办？

对于经常做数据爬虫的程序员来说，除了要写出简洁方便的代码。还需要解决的是网站反爬的问题。有时候我们在爬取数据的时候突然报出错或者目标网站错误代码。

比如说：403 Forbidden错误，“您的IP访问频率太高”错误，或者跳出一个验证码让我们输入，之后解封，但过一会又出现类似情况。

这些问题的出现都是爬取的网站触发了反爬机制。意思就是说网站服务器检测到同一个ip下超多的请求数达到网站设置的阈值自动开启验证，说白了就是封了这个IP，可能过几个小时就解封，但是爬虫不可能等。

为了解决此类问题，爬虫ip就派上了用场，如：爬虫ip软件、付费爬虫ip、ADSL拨号爬虫ip，都可以有效的解决爬虫脱离封IP的苦海。

测试爬虫ip请求及响应的网站:http://www.baidu.com。

baidu这个网站能测试 HTTP 请求和响应的各种信息，比如 cookie、ip、headers 和登录验证等。

且支持 GET、POST 等多种方法，对 web 开发和测试很有帮助。

它用 Python + Flask 编写，是一个开源项目。

返回信息中origin的字段就是客户端的IP地址，即可判断是否成功伪装IP：

爬虫ip的设置：

1、urllib的爬虫ip设置

from urllib.error import URLError

from urllib.request import ProxyHandler, build_opener

proxy = 'ip地址:端口'

#需要认证的爬虫ip

#proxy = 'username:password@ip地址:端口'

#使用ProxyHandler设置爬虫ip

proxy_handler = ProxyHandler({

 'http': 'http://' + proxy,

 'https': 'https://' + proxy

})

#传入参数创建Opener对象

opener = build_opener(proxy_handler)

try:

 response = opener.open('http://www.baidu.com')

 print(response.read().decode('utf-8'))

except URLError as e:

 print(e.reason)

2、requests的爬虫ip设置

import requests

proxy = 'ip地址:端口'

#需要认证的爬虫ip

#proxy = 'username:password@ip地址:端口'

proxies = {

 'http': 'http://' + proxy,

 'https': 'https://' + proxy,

}

try:

 response = requests.get('http://www.baidu.com', proxies=proxies)

 print(response.text)

except requests.exceptions.ConnectionError as e:

 print('Error', e.args)

3、Selenium的爬虫ip使用

使用的是PhantomJS

from selenium import webdriver

service_args = [

 '--proxy=ip地址:端口',

 '--proxy-type=http',

 #'--proxy-auth=username:password' #带认证爬虫ip

]

browser = webdriver.PhantomJS(service_args=service_args)

browser.get('http://www.baidu.com')

print(browser.page_source)

使用的是Chrome

from selenium import webdriver

proxy = 'ip地址:端口'

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--proxy-server=http://' + proxy)

chrome = webdriver.Chrome(chrome_options=chrome_options)

chrome.get('http://www.baidu.com')

4、在Scrapy使用爬虫ip

#在Scrapy的Downloader Middleware中间件里

 ...

 def process_request(self, request, spider):

 request.meta['proxy'] = 'http://ip地址:端口'

 ...

免费爬虫ip的使用

import requests,random

#定义爬虫ip池

proxy_list = [

 '182.39.6.245:38634',

 '115.210.181.31:34301',

 '123.161.152.38:23201',

 '222.85.5.187:26675',

 '123.161.152.31:23127',

]

# 免费ip链接：http://jshk.com.cn/mb/reg.asp?kefu=xjy

# 随机选择一个爬虫ip

proxy = random.choice(proxy_list)

proxies = {

 'http': 'http://' + proxy,

 'https': 'https://' + proxy,

}

try:

 response = requests.get('http://www.baidu.com', proxies=proxies)

 print(response.text)

except requests.exceptions.ConnectionError as e:

 print('Error', e.args)

在requests中使用爬虫ip

import requests

# 从爬虫ip服务中获取一个爬虫ip

proxy = requests.get("http://jshk.com.cn").text

proxies = {

 'http': 'http://' + proxy,

 'https': 'https://' + proxy,

}

try:

 response = requests.get('http://www.baidu.com', proxies=proxies)

 print(response.text)

except requests.exceptions.ConnectionError as e:

 print('Error', e.args)

巴特西

Python爬虫iP被封的怎么办？

最新文章

热门文章