Python爬虫学习笔记(四)
2024-10-06 19:36:33
Request:
Test1(基本属性:POST):
代码1:
import requests # 发送POST请求
data = { }
response = requests.post(url, data=data)
POST请求
Test2(auth认证):
代码2:
import requests # 发送POST请求
data = { }
response = requests.post(url, data=data) #内网 => 需要认证
auth = (user, pwd)
response = requests.get(url, auth=auth)
auth认证
Test3(添加代理IP):
代码3:
import requests # 1.请求url
url = 'http://www.baidu.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', }
free_proxy = {
'http': '58.249.55.222:9797'
}
response = requests.get(url=url, headers=headers, proxies=free_proxy)
print(response.status_code)
添加代理IP
Test4(SSL认证):
代码4:
import requests url = 'https://www.12306.cn/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', }
response = requests.get(url=url, headers=headers)
data = response.content.decode()
with open('03-ssl.html', 'w', encoding='utf-8')as f:
f.write(data)
错误示例
错误原因:
https是有第三方CA证书认证的
但12306虽然是https,但是他不是CA证书,他是自己颁布的证书
解决方法:
告诉web的服务器忽略证书访问
错误原因及解决方法
正确操作:
import requests url = 'https://www.12306.cn/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', }
response = requests.get(url=url, headers=headers, verify=False)
data = response.content.decode()
with open('03-ssl.html', 'w', encoding='utf-8')as f:
f.write(data)
忽略证书进行SSL认证
返回:
Test5(Cookies):
以抓取https://www.yaozh.com/为例
代码5:
import requests # 请求数据url
member_url = 'https://www.yaozh.com/mamber'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', }
# cookies字符串
cookies = 'acw_tc=2f624a2116146683053267637e08ee667240e1e39bfc4dbbbd0b8a8f33eaa4; PHPSESSID=em18pu35k89loei7glff6tthu5; _ga=GA1.2.1727298837.1614668307; _gid=GA1.2.781607072.1614668307; _gat=1; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1614668307; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1614668313; yaozh_logintime=1614668314; yaozh_user=1038868%09s1mpL3; yaozh_userId=1038868; yaozh_jobstatus=kptta67UcJieW6zKnFSe2JyYnoaSZ5htnZqdg26qb21rg66flM6bh5%2BscZdyVNaWz9Gwl4Ny2G%2BenofNlKqpl6XKppZVnKmflWlxg2lolJeaA7663ea03b06b64c50D898E7C9Aa493SkZSblWmHcNiemZtVq56lloN0pG2SaZ%2BGam2SbGiZnZWSmZyYaodw4g%3D%3Dc0addf8502ad2cb2f15b3ee2b0e1386b; db_w_auth=854448%09s1mpL3; UtzD_f52b_saltkey=oZ3HIEZl; UtzD_f52b_lastvisit=1614664715; UtzD_f52b_lastact=1614668315%09uc.php%09; UtzD_f52b_auth=25786v5G66Jeq3x5p14pj246attR7pdjk5X409nOsIqKeaEypUvH%2BA1LTN4dNrChFQS1hmPQNfLGoyIxAD8PRqgk4bA; yaozh_uidhas=1; yaozh_mylogin=1614668317; acw_tc=2f624a2116146683053267637e08ee667240e1e39bfc4dbbbd0b8a8f33eaa4'
# 但需要的是cookies的字典类型
cookies_dict = { }
cookies_list = cookies.split('; ')
for cookie in cookies_list:
cookies_dict[cookie.split('=')[0]] = cookie.split('=')[1] response = requests.get(url=member_url, headers=headers, cookies=cookies_dict)
data = response.content.decode()
with open('05-cookie.html', 'w', encoding='utf-8')as f:
f.write(data)
Cookies - 代码传参登陆
返回:
由返回页面可知,登录成功。
Test6(cookies - session):
以抓取https://www.yaozh.com/为例
代码6:
Cookies - 代码带着cookie去请求数据
返回:
由返回页面可知,登录成功。
数据解析:
Test1(正则表达式 - 贪婪模式):
代码1:
import re # 贪婪模式:从开头匹配到结尾
# 非贪婪模式:使用?
one = 'mdfjiefhuehfgieufn213431241234n'
pattern = re.compile('m(.*)n')
result = pattern.findall(one)
print(result)
贪婪模式
返回1:
['dfjiefhuehfgieufn213431241234']
Test2(正则表达式 - 费贪婪模式):
代码2:
import re # 贪婪模式:从开头匹配到结尾
# 非贪婪模式:使用?
one = 'mdfjiefhuehfgieufn213431241234n'
pattern = re.compile('m(.*?)n')
result = pattern.findall(one)
print(result)
返回2:
['dfjiefhuehfgieuf']
Test3(正则表达式 - 匹配换行符):
代码3:
import re # .除了换行符之外的匹配
one = """
maiwgdyuwagdwyadg
1234567612121134n
"""
pattern = re.compile('m(.*)n')
result = pattern.findall(one)
print(result)
不匹配换行符
返回3:
[]
代码4:
import re # .除了换行符之外的匹配
one = """
maiwgdyuwagdwyadg
1234567612121134n
"""
pattern = re.compile('m(.*)n', re.S)
result = pattern.findall(one)
print(result)
匹配换行符
返回4:
代码5:
import re # .除了换行符之外的匹配
one = """
maiwgdyuwagdwyadg
1234567612121134N
"""
pattern = re.compile('m(.*)n', re.S | re.I)
result = pattern.findall(one)
print(result)
忽略大小写
返回5:
Test4(正则表达式 - 纯数字的正则):
代码6:
import re # 纯数字的正则 \d 0 - 9之间的一个数
pattern = re.compile('^\d+$')
one = '1234' # 匹配判断的方法
result = pattern.match(one) print(result.group())
返回:
1234
Test5(正则表达式 - 范围运算):
代码7:
import re # 范围运算 [123] [1-9]
one = '798345' pattern = re.compile('[1-9]')
result = pattern.findall(one)
print(result)
返回:
['7', '9', '8', '3', '4', '5']
注:
- match:从头匹配,匹配一次
- search:从任意位置,匹配一次
- findall:查找符合正则的内容
- sub:替换字符串
- split:拆分
最新文章
- 【性能为王】从PHP源码剖析array_keys和array_unique
- 【C#公共帮助类】DateTimeHelper设置电脑本地时间,实际开发很需要
- 【Java每日一题】20161219
- MFC中文件的查找、创建、打开、读写等
- 2.openstack之mitaka搭建控制节点数据库和消息队列
- docker 会这些也够
- Windows2003 II6.0 FTP 开了防火墙 FTP不能正常工作的解决办法
- Linux搭建DNS服务器
- c++new和new()区别(了解)
- POJ 3928 Ping pong
- USB Type-C 连接器规范推出之后,市场很多低质量线材容易损坏设备
- Spring4.0学习笔记(6) —— 通过工厂方法配置Bean
- Fork 一个仓库并同步
- 什么是Git?
- Eclipse 安装反编译插件
- Django进阶篇【2】
- Linux下C/C++程序调试基础(GCC,G++,GDB,CGDB,DDD)
- Linux kernel的中断子系统之(二):IRQ Domain介绍
- SpringBoot集成rabbitmq(一)
- 理解 OAuth2.0
热门文章
- Luogu T14448 区间开方
- 2019HDU多校 Round10
- Consonant Fencity Gym - 101612C 暴力二进制枚举 Intelligence in Perpendicularia Gym - 101612I 思维
- 【noi 2.6_6046】数据包的调度机制(区间DP)
- HDU -1151 二分匹配与有向无环图不相交最小路径覆盖数
- CDN 概述
- Hexo之更换背景及透明度
- 基于OpenCV全景拼接(Python)SIFT/SURF
- (数据科学学习手札107)在Python中利用funct实现链式风格编程
- 如何在 网站页面中插入ppt/pdf 文件,使用插件,Native pdf 支持,chrome,Edge,Firefox,