Python爬虫初识

本文章是对网易云课堂中的Python网络爬虫实战课程进行总结。感兴趣的朋友可以观看视频课程。课程地址

爬虫简介

一段自动抓取互联网信息的程序

非结构化数据

没有固定的数据格式，如网页资料。

必须通过ETL(Extract,Transformation,Loading)工具将数据转化为结构化数据才能使用。

工具安装

Anaconda

pip install requests

pip install BeautifulSoup4

pip install jupyter

打开jupyter

jupyter notebook

requests 网络资源截取插件

取得页面

import requests

url = ''

res = requests.get(url)

res.encoding = 'utf-8'

print (res.text)

将网页读进BeautifulSoup中

from bs4 import BeautifulSoup

soup  = BeautifulSoup(res.text, 'html.parser')

print (soup.text)

使用select方法找找出特定标签的HTML元素，可取标签名或id，class返回的值是一个list

select('h1')   select('a')

id = 'thehead' select('#thehead')

alink = soup.select('a')

for link in alink:

    print (link['href'])

例子

1、取得新浪陕西的新闻时间标题和连接

import requests

from bs4 import BeautifulSoup

res = requests.get('http://sx.sina.com.cn/')

res.encoding = 'utf-8'

soup = BeautifulSoup(res.text, 'html.parser')

for newslist in soup.select('.news-list.cur'):

    for news in newslist:

        for li in news.select('li'):

            title = li.select('h2')[0].text

            href = li.select('a')[0]['href']

            time = li.select('.fl')[0].text

            print (time, title, href)

2、获取文章的标题，来源，时间和正文

import requests

from bs4 import BeautifulSoup

from datetime import datetime

res = requests.get('http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew5095240.shtml')

res.encoding = 'utf-8'

soup = BeautifulSoup(res.text, 'html.parser')

h1 = soup.select('h1')[0].text

source = soup.select('.source-time span span')[0].text

timesource = soup.select('.source-time')[0].contents[0].text

date = datetime.strptime(timesource, '%Y-%m-%d %H:%M')

article = []

for p in soup.select('.article-body p')[:-1]:

    article.append(p.text.strip())

' '.join(article)

简写为：

' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])

说明：

datatime 包用来格式化时间

[:-1]去除最后一个元素

strip() 移除字符串头尾指定的字符（默认为空格或换行符）

' '.join(article) 将列表以空格连接

3、获取文章的评论数，评论数是通过js写入，不能通过上面的方法获取到，在js下，找到文章评论的js

import requests

import json

comments = requests.get('http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-hcikcew5095240:0')

jd = json.loads(comments.text.strip('var data ='))

jd['result']['count']['sx:comos-hcikcew5095240:0']['total']

4、将获得评论的方法总结成一个函数

import re

import json

commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0'

def getCommentCounts(url):

    m = re.search('detail-i(.+).shtml' ,url)

    newsid = m.group(1)

    comments = requests.get(commenturl.format(newsid))

    jd = json.loads(comments.text.strip('var data ='))

    return jd['result']['count']['sx:comos-'+newsid+':0']['total']

news = 'http://sx.sina.com.cn/news/b/2018-06-01/detail-ihcikcev8756673.shtml'

getCommentCounts(news)

5、输入地址得到文章的所有信息（标题、时间、来源、正文等）的函数（完整版）

import requests

import json

import re

from bs4 import BeautifulSoup

from datetime import datetime

commenturl = 'http://comment5.news.sina.com.cn/cmnt/count?format=js&newslist=sx:comos-{}:0'

def getCommentCounts(url):

    m = re.search('detail-i(.+).shtml' ,url)

    newsid = m.group(1)

    comments = requests.get(commenturl.format(newsid))

    jd = json.loads(comments.text.strip('var data ='))

    return jd['result']['count']['sx:comos-'+newsid+':0']['total']

def getNewsDetail(newsurl):

    result = {}

    res = requests.get(newsurl)

    res.encoding = 'utf-8'

    soup = BeautifulSoup(res.text, 'html.parser')

    result['title'] = soup.select('h1')[0].text

    result['newssource'] = soup.select('.source-time span span')[0].text

    timesource = soup.select('.source-time')[0].contents[0].text

    result['date'] = datetime.strptime(timesource, '%Y-%m-%d %H:%M')

    result['article'] = ' '.join([p.text.strip() for p in soup.select('.article-body p')[:-1]])

    result['comments'] = getCommentCounts(newsurl)

    return result

news = 'http://sx.sina.com.cn/news/b/2018-06-02/detail-ihcikcew8995238.shtml'

getNewsDetail(news)

巴特西

Python爬虫初识

爬虫简介

非结构化数据

工具安装

requests 网络资源截取插件

取得页面

将网页读进BeautifulSoup中

使用select方法找找出特定标签的HTML元素，可取标签名或id，class返回的值是一个list

例子

1、取得新浪陕西的新闻时间标题和连接

2、获取文章的标题，来源，时间和正文

3、获取文章的评论数，评论数是通过js写入，不能通过上面的方法获取到，在js下，找到文章评论的js

4、将获得评论的方法总结成一个函数

5、输入地址得到文章的所有信息（标题、时间、来源、正文等）的函数（完整版）

最新文章

热门文章