DataWhale-Task4(爬取丁香园2)

任务:使用lxml爬虫帖子相关的回复与部分用户信息(用户名,头像地址,回复详情)

难点:需要登录才能看到所有回复

浏览器登录上去,查看cookies信息,复制,通过request.get()的参数使用标识登录身份的cookies,这样便着请求所回复(直接请求帖子主页的只是html,需要向对应的api发起请求才能看到回帖数据)

    cookies = {}
temp = "DXY_USER_GROUP=49; __auc=f5db5ffc1693f4415a8b9b324af; _ga=GA1.2.406327704.1551544823; _gid=GA1.2.832234072.1551600247; __utma=1.406327704.1551544823.1551575932.1551655682.5; __utmz=1.1551655682.5.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmz=3004402.1551676197.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JUTE_BBS_DATA=59a1c912e729f883a4072343bbd3cf3ce120b0d2a9073d1f98b4b4abb6b976855a9d60443f3910a91a2f4acd5ba5b796cb23f957a410053aaeca64aaa758f7468c5f3b6e4f5e8b3afdde9ab9a36c4e7c599039e6942142f476034f89445921cdfdac46fbcd62e2b2d57ebc2c50c50d8e1b14d314431af16b; __utmc=3004402; JUTE_SESSION_ID=d8ff12d6-4a18-49b7-a793-3cec155e2871; JUTE_TOKEN=364c5b97-0a5e-479b-bcb2-4fa6e665aa55; JSESSIONID=0D8D5058CEC7915AFFE1E95EEB7ECDF1; __utma=3004402.406327704.1551544823.1551693466.1551704245.3; __utmt=1; __utmb=3004402.1.10.1551704245; JUTE_SESSION=e8ecebb9b808ddb678837312dda5b1b477f72176e200de7dd3f4858315fb204c21184bb31cedc24a7c1f7c4dcccee51ab23a4595b8e44787b9fd92479d0a34424ab9ce058850dba8"
for i in temp.split(';'):
li = i.strip().split('=')
cookies[li[0]] = li[1]

完整代码

import requests
import json
import re
from lxml import etree def display(topic):
"""
topic: 字典,键值有 topic comment
topic key 主题名
comment key 相关评论
"""
print("主题:\n", topic['topic'])
print("主题评论:\n")
for item in topic['comment']:
for k, v in item.items():
print(k, '\n')
print('\t头像:', v['avatar'], '\n')
print('\t评论:', v['body'], '\n') def main(url, headers, cookies):
topic = {}
index = 1
resp = requests.get(url, headers=headers, cookies=cookies)
maxpage = resp.json()['pageBean']['total'] # 获取回复的全部页数
topic['topic'] = resp.json()['subject'] # 该帖子的主题
topic['comment'] = [] # 回复列表
while index < maxpage:
target_url = url.format(index)
resp = requests.get(target_url, headers=headers, cookies=cookies)
for item in resp.json()['items']:
d = {
item['nickname']: {'avatar': item['user']['avatar'],
'body': item["body"]}
}
topic['comment'].append(d)
index += 1
display(topic) if __name__ == '__main__':
cookies = {}
headers = {
'User-Agent': ('Mozilla/5.0 (X11; Linux x86_64)'
' AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/68.0.3440.106 Safari/537.36')
}
# temp为复制的 cookies
temp = "DXY_USER_GROUP=49; __auc=f5db5ffc1693f4415a8b9b324af; _ga=GA1.2.406327704.1551544823; _gid=GA1.2.832234072.1551600247; __utma=1.406327704.1551544823.1551575932.1551655682.5; __utmz=1.1551655682.5.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmz=3004402.1551676197.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JUTE_BBS_DATA=59a1c912e729f883a4072343bbd3cf3ce120b0d2a9073d1f98b4b4abb6b976855a9d60443f3910a91a2f4acd5ba5b796cb23f957a410053aaeca64aaa758f7468c5f3b6e4f5e8b3afdde9ab9a36c4e7c599039e6942142f476034f89445921cdfdac46fbcd62e2b2d57ebc2c50c50d8e1b14d314431af16b; __utmc=3004402; JUTE_SESSION_ID=d8ff12d6-4a18-49b7-a793-3cec155e2871; JUTE_TOKEN=364c5b97-0a5e-479b-bcb2-4fa6e665aa55; JSESSIONID=0D8D5058CEC7915AFFE1E95EEB7ECDF1; __utma=3004402.406327704.1551544823.1551693466.1551704245.3; __utmt=1; __utmb=3004402.1.10.1551704245; JUTE_SESSION=e8ecebb9b808ddb678837312dda5b1b477f72176e200de7dd3f4858315fb204c21184bb31cedc24a7c1f7c4dcccee51ab23a4595b8e44787b9fd92479d0a34424ab9ce058850dba8"
for i in temp.split(';'):
li = i.strip().split('=')
cookies[li[0]] = li[1]
url = ("http://3g.dxy.cn/bbs/bbsapi/mobile?"
"s=view_topic&checkUserAction=1&with"
"Good=1&order=0&size=20&id=509959&page={}")
main(url, headers, cookies)

结果

最新文章

  1. scrollTop和scrollLeft的兼容解决万全方法
  2. Multipart to single part feature
  3. htmlnav
  4. python --那些你应该知道的知识点
  5. Android开发之获取状态栏高度、屏幕的宽和高
  6. ECSHOP 数据库结构说明 (适用版本v2.7.3)
  7. SQL 左外连接,右外连接,全连接,内连接
  8. INI文件的读写
  9. 具体的例子来教你怎么做LoadRunner结果分析
  10. Host和Server的开发
  11. nant build
  12. 【3】docker命令集
  13. API做翻页的两种思路
  14. react中的refs
  15. 你真的知道什么是【共享Session】,什么是【单点登录】吗?
  16. 2013传智播客视频--.ppt,.pptx,.doc,.docx.目录
  17. 体验 PHP under .NET Core
  18. 2017-4-28/PHP实现Redis
  19. VMware 14 tools位置;
  20. c# SSH ,SFTP

热门文章

  1. Oracle学习(12):存储过程,函数和触发器
  2. PL/SQL Developer使用技巧、快捷键(转发)
  3. mysql数据库字符编码修改
  4. bzoj2132: 圈地计划(无比强大的最小割)
  5. 洛谷P2593 [ ZJOI 2006 ] 超级麻将 —— DP
  6. C#中动态读取配置
  7. 数据通讯与网络 第五版第24章 传输层协议-TCP协议部分要点
  8. 《CSS Mastery》读书笔记(4)
  9. python--8、socket网络编程
  10. Android Retrofit 2.0文件上传