数据解析模块BeautifulSoup简单使用

一、准备环境：

1、准备测试页面test.html

<html>

<head>

    <title>

        The Dormouse's story

    </title>

</head>

<body>

<p class="title">

    <b>

        The Dormouse's story

    </b>

</p>

<p class="story">

    Once upon a time there were three little sisters; and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

        Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

        Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link2">

        Tillie

    </a>

    ; and they lived at the bottom of a well.

</p>

<p class="story">

    ...

</p>

</body>

</html>

test.html

2、安装相关模块

pip install bs4

pip install requests

二、beautifulsoup相关语法：

1、实例化beautifulsoup对象

from bs4 import BeautifulSoup

# 实例化BeautifulSoup对象

# 1、转化本地HTML文件

soup = BeautifulSoup(open('本地文件'), 'lxml')

# 如使用本地文件

with open('test.html',mode='r',encoding='utf-8') as f:

    soup = BeautifulSoup(f,'lxml')

print(soup.a)   # 打印第一个a标签的所有内容

# 2、通过requests.get或其它方式获取到的HTML数据

soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')

# 如通过requests获取到的网页数据

from requests

page_html = requests.get(url='http://www.baidu.com').text

soup = BeautifulSoup(page_html, 'lxml')

print(soup.a)   # 打印第一个a标签的所有内容

2、通过实例化对象获取标签，标签内容，标签属性（这里以上面准备的test.html为示例进行演示）。

import requests

from bs4 import BeautifulSoup

with open('test.html',mode='r',encoding='utf-8') as f:

    soup = BeautifulSoup(f,'lxml')

print(soup.title)             # 打印title标签的全部内容

print(soup.a)                 # 打印a标签的全部内容

print(soup.a.attrs)           # 打印a标签的所有属性内容

print(soup.a.attrs['href'])   # 打印a标签href属性的值

print(soup.a['href'])         # 也可以简写

# 打印a标签中的文本内容内容

print(soup.a.string)

print(soup.a.text)

print(soup.a.get_text())

# 需要注意的是，如果a标签中还嵌套有其它标签，soup.a.string将获取不到值返回一个None，
# 而soup.a.text和soup.a.get_text()可以获取到包括a标签在内的所有子标签中的文本内容。

# 注意：soup.tagName只定位到第一次出现的tagName标签便结束匹配

soup.find('a')                                         # 与soup.tagName一样只匹配到第一次出现的。不同的是可以使用标签和属性进行联合查找。

print(soup.find('a',{'class':"sister",'id':'link2'}))  # 根据标签和属性进行定位

find_all()  # 和find的用法一样，只是返回值是一个列表，这里就不演示了

# 根据选择器进行定位

# 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器

soup.select('a')              # 根据标签定位到所有a标签

print(soup.select('.sister')) # 根据类名sister定位

print(soup.select('#link1'))  # 根据id 进行定位

print(soup.select('p>a'))     # 定位所有p标签下的a标签

巴特西

数据解析模块BeautifulSoup简单使用

一、准备环境：

二、beautifulsoup相关语法：

最新文章

热门文章