爬B站并保存成csv文件。提供数据
2024-08-31 20:59:08
"""
b站排行榜爬虫(scrapy)
https://www.bilibili.com/ranking#!/all/0/0/7/
爬取编号,标题,url,综合评分,播放量,评论数
存储到mysql数据库
"""
import requests
from fake_useragent import FakeUserAgent
from lxml import etree
import re
import csv
url = 'https://www.bilibili.com/ranking#!/all/0/0/7/'
# 代理ip
proxies = {"http":"101.65.24.108:8118"}
headers = {
'User-Agent': FakeUserAgent().random
}
html = requests.get(url,params=proxies,headers=headers).text
# print(html)
# 构造树形结构
html1=etree.HTML(html)
"""
爬取编号,标题,url,综合评分,播放量,评论数
分析编号:
<div class="num">1</div>
<div class="num">2</div>
分析标题:
<a href="//www.bilibili.com/video/av55443085/" target="_blank" class="title">【党妹】三十变十三!毕业季必须拥有的芒果系JK妆容,成为甜甜山吹女孩!</a>
<a href="//www.bilibili.com/video/av55210171/" target="_blank" class="title">【中字.迪士尼反派系列2】后妈们的抱怨</a>
分析评分:
<div class="">2087768</div>
<div class="">1715927</div>
"""
bianhao = html1.xpath('//div[@class="num"]/text()')
print(bianhao)
titles = html1.xpath('//a[@class="title"]/text()')
print(titles)
urls = html1.xpath('//a[@class="title"]/@href')
# print(urls)
# 将url进行处理
url_list = []
for url in urls:
url = url.replace("//","").replace("/","")
url_list.append(url)
print(url_list)
grade = html1.xpath('//div[@class="pts"]/div/text()')
print(grade)
# 播放量
vv = html1.xpath('//div[@class="detail"]/span[1]/text()')
print(vv)
# 评论数
comment = html1.xpath('//div[@class="detail"]/span[2]/text()')
print(comment)
# 对数据进行处理保存成csv文件
# 使用zip函数,让数据一一对应
data_list = []
res = zip(bianhao,titles,url_list,grade,vv,comment)
for data in res:
data_list.append(data)
print(data_list)
# 打开一个csv文件
with open('../files/data/bzhan.csv','w',encoding='utf-8') as file:
csv_f = csv.writer(file)
# 添加第一行
csv_f.writerow(["id","title","url","grade","vv","comment"])
for row in data_list:
csv_f.writerow(row)
最新文章
- jprofiler_监控远程linux服务器的JVM进程(实践)
- 什么才是正确的javascript数组检测方式
- OC中的深拷贝与浅拷贝
- Android Activity交互及App交互
- C++学习笔记(十一):void*指针、类型转换和动态内存分配
- FileSystemWatcher使用方法
- Ubuntu 15.04 Rails4.2.5 处理异常
- form表单直接传文件
- js调用Webservice接口案例
- L1-006 连续因子 (20 分) 模拟
- sqlserver触发器insert,delete,update
- Mysql连接数太多ERROR 1040 (HY000): Too many connections
- NumPy学习(让数据处理变简单)
- jquery鼠标放上去显示悬浮层即弹出定位的div层
- 三种方式解决你的js加载乱码
- mysql架构图
- JavaScript实现表单验证
- Breadth-first Search-690. Employee Importance
- SpringCloud初体验:前言
- MyBatis 查询缓存