9.3.4 BeaufitulSoup4

　　BeautifulSoup 是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。
　　使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。
　　下面简单演示下BeautifulSoup4的功能，更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
 >>> from bs4 import BeautifulSoup

 >>>

 >>> #自动添加和补全标签

 >>> BeautifulSoup('hello world','lxml')

 <html><body><p>hello world</p></body></html>

 >>>

 >>> #自定义一个html文档内容

 >>> html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 >>>

 >>> #解析这段html文档内容，以优雅的方式展示出来

 >>> soup = BeautifulSoup(html_doc,'html.parser')

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

     Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>> #访问特定标签

 >>> soup.title

 <title>The Dormouse's story</title>

 >>>

 >>> #标签名字

 >>> soup.title.name

 'title'

 >>>

 >>> #标签文本

 >>> soup.title.text

 "The Dormouse's story"

 >>>

 >>> #title标签的上一级标签

 >>> soup.title.parent

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.head

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.b

 <b>The Dormouse's story</b>

 >>>

 >>> soup.b.name

 'b'

 >>> soup.b.text

 "The Dormouse's story"

 >>>

 >>> #把整个BeautifulSoup对象看作标签对象

 >>> soup.name

 '[document]'

 >>>

 >>> soup.body

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 </body>

 >>>

 >>> soup.p

 <p class="title"><b>The Dormouse's story</b></p>

 >>>

 >>> #标签属性

 >>> soup.p['class']

 ['title']

 >>>

 >>> soup.p.get('class')         #也可以这样查看标签属性

 ['title']

 >>>

 >>> soup.p.text

 "The Dormouse's story"

 >>>

 >>> soup.p.contents

 [<b>The Dormouse's story</b>]

 >>>

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

 >>>

 >>> #查看a标签所有属性

 >>> soup.a.attrs

 {'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}

 >>>

 >>> #查找所有a标签

 >>> soup.find_all('a')

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> #同时查找<a>和<b>标签

 >>> soup.find_all(['a','b'])

 [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> import re

 >>> #查找href包含特定关键字的标签

 >>> soup.find_all(href=re.compile("elsie"))

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 >>>

 >>> soup.find(id='link3')

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 >>>

 >>> soup.find_all('a',id='link3')

 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> for link in soup.find_all('a'):

     print(link.text,':',link.get('href'))

 Elsie : http://example.com/elsie

 Lacie : http://example.com/lacie

 Tillie : http://example.com/tillie

 >>>

 >>> print(soup.get_text())           #返回所有文本

 The Dormouse's story

 The Dormouse's story

 Once upon a time there were three little sisters;and their names were

 Elsie,

 Lacieand

 Tillie;

 and they lived at the bottom of a well.

 ...

 >>>

 >>> #修改标签属性

 >>> soup.a['id']='test_link1'

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

 >>>

 >>> #修改标签文本

 >>> soup.a.string.replace_with('test_Elsie')

 'Elsie'

 >>>

 >>> soup.a.string

 'test_Elsie'

 >>>

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="test_link1">

     test_Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>>

 >>> #遍历子标签

 >>> for child in soup.body.children:

     print(child)

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 >>>
巴特西

9.3.4 BeaufitulSoup4

最新文章

热门文章