大规模爬取(新浪为例子)网页之downloader、parser的封装(涉及编码等细节)
2024-09-08 18:17:06
import requests
import cchardet
import traceback
from lxml import etree def downloader(url,timeout = 10,headers = None,debug = False, binary = False):
_headers = {
'User-Agent': ('Mozilla/5.0 (compatible; MSIE 9.0; '
'Windows NT 6.1; Win64; x64; Trident/5.0)')
}
redirected_url = url
if headers:
headers = _headers
try:
res = requests.get(url,headers,timeout = timeout)
if binary:
html = res.content
else:
encoding = cchardet.detect(res.content)["encoding"]
html = res.content.decode(encoding)
status = res.status_code
redirected_url = res.url
except:
if debug:
traceback.print_exc()
msg = "failed download:{}".format(url)
print(msg)
if binary:
html =b""
else:
html = ""
status = 0
return status,html,redirected_url def parser(html):
d = 0
tree = etree.HTML(html)
divs_list = tree.xpath(".//div[@class = 'main']/div[contains(@class,'clearfix')]")
for div in divs_list:
a_list = div.xpath(".//ul[contains(@class,'list-a')]//a")
for i in a_list:
try:
href = i.xpath("./@href")[0].strip().replace("\\n",'').replace('\\t','')
title = i.xpath("./text()")[0].strip().replace("\\n",'').replace('\\t','')
d += 1
print(d,(href,title))
except (IndexError) as e:
pass if __name__ == '__main__':
url = r"https://www.sina.com.cn/"
status,html,redirected_url = downloader(url)
paser = parser(html)
#print(status,html,redirected_url)
最新文章
- SimpleDateFormat df = new SimpleDateFormat(";yyyy-MM-dd HH:mm:ss";);//设置日期格式
- YII 的源码分析(-)
- HTTP协议发展脉络
- 【CC评网】2013.第44周 把握每天的第一个小时
- winfrom LED时钟
- PHPCMS搭建wap手机网站
- Android开发必知--自定义Toast提示
- 调试出不来 断点不起作用 调试技巧 MyEclipse进不了调试
- ps 命令的十个简单用法
- 19. vue的原理
- android + eclipse + 后台静默安装(一看就会)
- ShopEx customSchema 定制能够依据客户的需求对站点进行对应功能的加入改动或者删除
- LeetCode 318. Maximum Product of Word Lengths (状态压缩)
- linux下的pd
- J2SE 8的注解
- [BZOJ 4573][ZJOI 2016]大森林
- ECShop 2.x 3.0代码执行漏洞分析
- 浅谈ES6新特性
- Android Studio NDK环境配置
- 问题集录04--json和jsonp讲解