Python爬虫初识
2024-10-21 14:29:12
本文章是对网易云课堂中的Python网络爬虫实战课程进行总结。感兴趣的朋友可以观看视频课程。课程地址
爬虫简介
一段自动抓取互联网信息的程序
非结构化数据
没有固定的数据格式,如网页资料。
必须通过ETL(Extract,Transformation,Loading)工具将数据转化为结构化数据才能使用。
工具安装
Anaconda
pip install requests
pip install BeautifulSoup4
pip install jupyter
打开jupyter
jupyter notebook
requests 网络资源截取插件
取得页面
import requests
url = ''
res = requests.get(url)
res.encoding = 'utf-8'
print (res.text)
将网页读进BeautifulSoup中
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')
print (soup.text)
使用select方法找找出特定标签的HTML元素,可取标签名或id,class返回的值是一个list
select('h1') select('a')
id = 'thehead' select('#thehead')
alink = soup.select('a')
for link in alink:
print (link['href'])
例子
1、取得新浪陕西的新闻时间标题和连接
import requests
from bs4 import BeautifulSoup
res = requests.get('http://sx.sina.com.cn/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser') for newslist in soup.select('.news-list.cur'):
for news in newslist:
for li in news.select('li'):
title = li.select('h2')[0].text
href = li.select('a')[0]['href']
time = li.select('.fl')[0].text
print (time, title, href)
2、获取文章的标题,来源,时间和正文
import requests
from bs4 import BeautifulSoup
from datetime import datetime
res = requests.get('http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew5095240.shtml')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser') h1 = soup.select('h1')[0].text
source = soup.select('.source-time span span')[0].text
timesource = soup.select('.source-time')[0].contents[0].text
date = datetime.strptime(timesource, '%Y-%m-%d %H:%M') article = []
for p in soup.select('.article-body p')[:-1]:
article.append(p.text.strip()) ' '.join(article)
简写为:
' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])
说明:
datatime 包用来格式化时间
[:-1]去除最后一个元素
strip() 移除字符串头尾指定的字符(默认为空格或换行符)
' '.join(article) 将列表以空格连接
3、获取文章的评论数,评论数是通过js写入,不能通过上面的方法获取到,在js下,找到文章评论的js
import requests
import json comments = requests.get('http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-hcikcew5095240:0')
jd = json.loads(comments.text.strip('var data =')) jd['result']['count']['sx:comos-hcikcew5095240:0']['total']
4、将获得评论的方法总结成一个函数
import re
import json commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0' def getCommentCounts(url):
m = re.search('detail-i(.+).shtml' ,url)
newsid = m.group(1)
comments = requests.get(commenturl.format(newsid))
jd = json.loads(comments.text.strip('var data ='))
return jd['result']['count']['sx:comos-'+newsid+':0']['total'] news = 'http://sx.sina.com.cn/news/b/2018-06-01/detail-ihcikcev8756673.shtml'
getCommentCounts(news)
5、输入地址得到文章的所有信息(标题、时间、来源、正文等)的函数(完整版)
import requests
import json
import re
from bs4 import BeautifulSoup
from datetime import datetime commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0' def getCommentCounts(url):
m = re.search('detail-i(.+).shtml' ,url)
newsid = m.group(1)
comments = requests.get(commenturl.format(newsid))
jd = json.loads(comments.text.strip('var data ='))
return jd['result']['count']['sx:comos-'+newsid+':0']['total'] def getNewsDetail(newsurl):
result = {}
res = requests.get(newsurl)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
result['title'] = soup.select('h1')[0].text
result['newssource'] = soup.select('.source-time span span')[0].text
timesource = soup.select('.source-time')[0].contents[0].text
result['date'] = datetime.strptime(timesource, '%Y-%m-%d %H:%M')
result['article'] = ' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])
result['comments'] = getCommentCounts(newsurl)
return result news = 'http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew8995238.shtml'
getNewsDetail(news)
最新文章
- 在Debian下编译Postgresql
- LWP 轻量级线程的意义与实现
- httpURLConnection-网络请求的两种方式-get请求和post请求
- iOS 图片按比例压缩,指定大小压缩
- windows下安装mysql笔记
- Stack-based buffer overflow in acdb audio driver (CVE-2013-2597)
- WP中一些耗时的东西
- Servlet概述-servlet学习之旅(一)
- Cashier Employment 差分约束
- EF Like
- 检测MySQL主从备份是否运行
- iOS - 跳转到系统设置
- centos6.6安装hadoop-2.5.0(一、本地模式安装)
- ubantu 设置默认python3.叽叽叽的环境变量
- dotNet Core WEB程序使用 Nginx反向代理
- oninput和onpropertychange实时监听输入框值的变化
- 在win7_64bit + ubuntu-12.04-desktop-amd64+VMware-workstation-full-10.0.1-1379776平台上安装ns-allinone-2.35
- 简单Shell案例
- Java考试题之五
- 常用的基本控件 android常用控件