py3+requests+re+urllib,爬取并下载不得姐视频
2024-10-10 14:04:46
实现原理及思路请参考我的另外几篇爬虫实践博客
py3+urllib+bs4+反爬,20+行代码教你爬取豆瓣妹子图:http://www.cnblogs.com/UncleYong/p/6892688.html
py3+requests+json+xlwt,爬取拉勾招聘信息:http://www.cnblogs.com/UncleYong/p/6960044.html
py3+urllib+re,轻轻松松爬取双色球最近100期中奖号码:http://www.cnblogs.com/UncleYong/p/6958242.html
实现代码如下:
import urllib.request, re, requests url_name = []
def get():
hd = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
url = 'http://www.budejie.com/video/'
html = requests.get(url, headers=hd).text
# print(html)
url_content = re.compile(r'(<div class="j-r-list-c">.*?</div>.*?</div>)',re.S)
url_contents = re.findall(url_content,html)
# print(url_contents)
for i in url_contents: # 大盒子里面的html
url_reg = r'data-mp4="(.*?)"'
url_item = re.findall(url_reg,i)
# print(type(url_items)) # <class 'list'>
# print(url_item)
if url_item:
name_reg = re.compile(r'<a href="/detail-.{8}?.html">(.*?)</a>',re.S) # .{8}?匹配8位数字
name_item = re.findall(name_reg,i) # findall返回的是一个列表
# print(type(name_items)) # <class 'list'>
# print(name_items)
for i,k in zip(name_item,url_item):
url_name.append([i,k]) # 将列表添加到列表中,其实,也可以将元组存入列表,url_name.append((i,k))
# print(url_name)
# print(i,k)
for i in url_name:
print('正在下载>>>>> '+i[0]+':'+i[1])
# 每个元素的i[0]是名称,i[1]是视频url
urllib.request.urlretrieve(i[1],'video/%s.mp4'%(i[0])) # video\\%s if __name__ == '__main__':
get()
最新文章
- Apache Cordova开发Android应用程序——番外篇
- 面向科学计算的Python IDE--Anaconda
- ASP.NET TextBox 当鼠标点击后清空默认提示文字
- Azure Media Service (1) 使用OBS进行Azure Media Service直播
- SQL Server 2016的数据库范围内的配置
- HTTP协议的几个概念
- hdu 4961 Boring Sum
- 批量安装操作系统之cobbler
- null和空 not null
- More on 1Password’s Components
- List of XML and HTML character entity references
- 调整Tomcat的并发线程到5000+
- Android apk file
- JS数字金额大写转换
- 处理文件中的"; M-BM- ";特殊符号
- java session创建与获取
- Anaconda环境下安装库
- jexus托管.net core
- mybatis二级缓存应用及与ehcache整合
- linux环境下,对于一个大文件,如何查看其中某行的内容