python爬虫之趟雷整理

雷一:URLError

  问题具体描述:urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed

 import urllib.request

 def load_message():
url = 'http://www.baidu.com' request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
response_str = response.read().decode('utf-8') return response.headers, request.headers, response_str response_header, request_header, response_data = load_message()
print(request_header)
print('----------------------------------------')
print(response_header)
print('----------------------------------------')
print(response_data)

  分析:报错原因为URLError,产生原因为URL,简单来说,就是URL资源无法访问或者访问不了。具体问题出在三个方向,URL本身,客户端,服务器。

  解决办法:第一点,检查URL书写是否正确;第二点,检查客户端网络连接状态;第三点,使用URL在浏览器地址栏访问验证服务器是否存在。

  问题具体描述:urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>

 #!/usr/bin/env python
# -*- coding=utf-8 -*-
# Author: Snow import urllib.request def create_cookie():
url = 'https://www.yaozh.com/member/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko\
Chrome/69.0.3497.92 Safari/537.36',
'Cookie': 'think_language=zh-CN; _ga=GA1.2.179792116.1550119571; _gat=1; acw_tc=2f624a2115501195808648935e4f2de7e89205315a7c9e8934c938389d8999; _gid=GA1.2.111857803.1550119581; yaozh_logintime=1550119751; yaozh_user=692948%09snown_1; yaozh_userId=692948; yaozh_uidhas=1; acw_tc=2f624a2115501195808648935e4f2de7e89205315a7c9e8934c938389d8999; MEIQIA_VISIT_ID=1H9g97Ef1WpjYsWf4b7UlGe3wel; PHPSESSID=5itl5rejqnekb07bfrtmuvr3l6; yaozh_mylogin=1550196658; MEIQIA_VISIT_ID=1HCCOYdyjR0FalzMfFm4vYsqevT; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1550119570%2C1550119584%2C1550119751%2C1550196659; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1550196663'
} request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
response_data = response.read().decode('utf-8') return response_data result = create_cookie()
with open('cookies.html', 'w', encoding='utf-8') as f:
f.write(result)

  分析:问题产生原因python使用urllib.request,urlopen()打开https链接时,需要验证SSL证书,如果网站使用自签名的证书会抛出异常。

  解决办法:第一点,使用SSL创建context验证上下文,传入urlopen()中context上下文参数;第二点,取消证书验证。

 #!/usr/bin/env python
# -*- coding=utf-8 -*-
# Author: Snow import urllib.request
import ssl #导入ssl模块 def create_cookie():
url = 'https://www.yaozh.com/member/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko\
Chrome/69.0.3497.92 Safari/537.36',
'Cookie': 'think_language=zh-CN; _ga=GA1.2.179792116.1550119571; _gat=1; acw_tc=2f624a2115501195808648935e4f2de7e89205315a7c9e8934c938389d8999; _gid=GA1.2.111857803.1550119581; yaozh_logintime=1550119751; yaozh_user=692948%09snown_1; yaozh_userId=692948; yaozh_uidhas=1; acw_tc=2f624a2115501195808648935e4f2de7e89205315a7c9e8934c938389d8999; MEIQIA_VISIT_ID=1H9g97Ef1WpjYsWf4b7UlGe3wel; PHPSESSID=5itl5rejqnekb07bfrtmuvr3l6; yaozh_mylogin=1550196658; MEIQIA_VISIT_ID=1HCCOYdyjR0FalzMfFm4vYsqevT; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1550119570%2C1550119584%2C1550119751%2C1550196659; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1550196663'
} context = ssl._create_unverified_context() # 创建验证SSL上下文 request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request, context=context) # 传入context参数
response_data = response.read().decode('utf-8') return response_data result = create_cookie()
with open('cookies.html', 'w', encoding='utf-8') as f:
f.write(result)
 #!/usr/bin/env python
# -*- coding=utf-8 -*-
# Author: Snow import urllib.request
import ssl def create_cookie():
url = 'https://www.yaozh.com/member/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko\
Chrome/69.0.3497.92 Safari/537.36',
'Cookie': 'think_language=zh-CN; _ga=GA1.2.179792116.1550119571; _gat=1; acw_tc=2f624a2115501195808648935e4f2de7e89205315a7c9e8934c938389d8999; _gid=GA1.2.111857803.1550119581; yaozh_logintime=1550119751; yaozh_user=692948%09snown_1; yaozh_userId=692948; yaozh_uidhas=1; acw_tc=2f624a2115501195808648935e4f2de7e89205315a7c9e8934c938389d8999; MEIQIA_VISIT_ID=1H9g97Ef1WpjYsWf4b7UlGe3wel; PHPSESSID=5itl5rejqnekb07bfrtmuvr3l6; yaozh_mylogin=1550196658; MEIQIA_VISIT_ID=1HCCOYdyjR0FalzMfFm4vYsqevT; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1550119570%2C1550119584%2C1550119751%2C1550196659; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1550196663'
} ssl._create_default_https_context = ssl._create_unverified_context # 缺省context参数不做验证,取消验证ssl证书 request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
response_data = response.read().decode('utf-8') return response_data result = create_cookie()
with open('cookies.html', 'w', encoding='utf-8') as f:
f.write(result)

雷二:HTTPError

  问题具体描述:urllib.error.HTTPError: HTTP Error 503: Service Temporarily Unavailable

 #!/usr/bin/env python
# -*- coding=utf-8 -*-
# Author: Snow import urllib.request def fee_proxy():
url = 'https://www.xicidaili.com/nn/' # 付费代理IP第一种方式
# proxy_1 = {
# 'http': 'user_name:passswor@121.61.1.222:9999'
# } # 付费代理IP第二种方式
user_name = 'admin'
password = ''
proxy_ip = '121.61.1.222:9999'
proxy_manage = urllib.request.HTTPPasswordMgrWithDefaultRealm() # 密码管理器
proxy_manage.add_password(None, proxy_ip, user_name, password) # proxy_handler = urllib.request.ProxyHandler(proxy_1)
proxy_handler = urllib.request.ProxyBasicAuthHandler(proxy_manage) # 代理IP验证处理器
proxy_openner = urllib.request.build_opener(proxy_handler) response = proxy_openner.open(url)
response_str = response.read().decode('utf-8') return response_str data = fee_proxy()
print(data)

  分析

  解决办法

最新文章

  1. 基于Spring+SpringMVC+Mybatis的Web系统搭建
  2. 理解 OpenStack 高可用(HA)(5):RabbitMQ HA
  3. mac上的git环境配置
  4. 设计模式之UML类图的常见关系(一)
  5. Sass函数--数字函数
  6. attachEvent与addEventlistener兼容性
  7. python基础教程_学习笔记10:异常
  8. hdu_4787_GRE Words Revenge(在线AC自动机)
  9. sublime addons backup
  10. Win10安装和配置JDK
  11. Selecting Courses POJ - 2239(我是沙雕吧 按时间点建边 || 匹配水题)
  12. 清北学堂 清北-Day1-R2-监听monitor
  13. JS-基础动画心得
  14. [日常工作]Oracle新增数据文件的小知识点
  15. 海量数据中找top K专题
  16. com.alibaba.dubbo.rpc.RpcException: Fail to start server(url: dubbo://192.16。。
  17. day4 递归原理及解析
  18. CodeForces 803A Maximal Binary Matrix
  19. 【转】获取Windows系统明文密码神器
  20. Huffman树的构造及编码与译码的实现

热门文章

  1. 在Asp.Net中使用amChart统计图
  2. (转)XSS危害——session劫持
  3. TensorFlow安装教程
  4. Oracle的列操作(增加列,修改列,删除列),包括操作多列
  5. Android Tablayout属性介绍
  6. oracle 中用法dual
  7. MariaDB 数据库迁移
  8. jquery $.each()循环退出
  9. 关于DataGridView的ClearSelection方法
  10. 【bzoj2500】幸福的道路 树形dp+单调队列