Python爬虫入门教程石家庄链家租房数据抓取

1. 写在前面

这篇博客爬取了链家网的租房信息，爬取到的数据在后面的博客中可以作为一些数据分析的素材。
我们需要爬取的网址为：https://sjz.lianjia.com/zufang/

2. 分析网址

首先确定一下，哪些数据是我们需要的

可以看到，黄色框就是我们需要的数据。

接下来，确定一下翻页规律

https://sjz.lianjia.com/zufang/pg1/

https://sjz.lianjia.com/zufang/pg2/

https://sjz.lianjia.com/zufang/pg3/

https://sjz.lianjia.com/zufang/pg4/

https://sjz.lianjia.com/zufang/pg5/

...

https://sjz.lianjia.com/zufang/pg80/

3. 解析网页

有了分页地址，就可以快速把链接拼接完毕，我们采用lxml模块解析网页源码，获取想要的数据。

本次编码使用了一个新的模块 fake_useragent ，这个模块，可以随机的去获取一个UA（user-agent），模块使用比较简单，可以去百度百度就很多教程。

本篇博客主要使用的是调用一个随机的UA

self._ua = UserAgent()

self._headers = {"User-Agent": self._ua.random}  # 调用一个随机的UA

由于可以快速的把页码拼接出来，所以采用协程进行抓取，写入csv文件采用的pandas模块

from fake_useragent import UserAgent

from lxml import etree

import asyncio

import aiohttp

import pandas as pd

class LianjiaSpider(object):

    def __init__(self):

        self._ua = UserAgent()

        self._headers = {"User-Agent": self._ua.random}

        self._data = list()

    async def get(self,url):

        async with aiohttp.ClientSession() as session:

            try:

                async with session.get(url,headers=self._headers,timeout=3) as resp:

                    if resp.status==200:

                        result = await resp.text()

                        return result

            except Exception as e:

                print(e.args)

    async def parse_html(self):

        for page in range(1,77):

            url = "https://sjz.lianjia.com/zufang/pg{}/".format(page)

            print("正在爬取{}".format(url))

            html = await self.get(url)   # 获取网页内容

            html = etree.HTML(html)  # 解析网页

            self.parse_page(html)   # 匹配我们想要的数据

            print("正在存储数据....")

            ######################### 数据写入

            data = pd.DataFrame(self._data)

            data.to_csv("链家网租房数据.csv", encoding='utf_8_sig')   # 写入文件

            ######################### 数据写入

    def run(self):

        loop = asyncio.get_event_loop()

        tasks = [asyncio.ensure_future(self.parse_html())]

        loop.run_until_complete(asyncio.wait(tasks))

if __name__ == '__main__':

    l = LianjiaSpider()

    l.run()

上述代码中缺少一个解析网页的函数，我们接下来把他补全

def parse_page(self,html):

        info_panel = html.xpath("//div[@class='info-panel']")

        for info in info_panel:

            region = self.remove_space(info.xpath(".//span[@class='region']/text()"))

            zone = self.remove_space(info.xpath(".//span[@class='zone']/span/text()"))

            meters = self.remove_space(info.xpath(".//span[@class='meters']/text()"))

            where = self.remove_space(info.xpath(".//div[@class='where']/span[4]/text()"))

            con = info.xpath(".//div[@class='con']/text()")

            floor = con[0]  # 楼层

            type = con[1]   # 样式

            agent = info.xpath(".//div[@class='con']/a/text()")[0]

            has = info.xpath(".//div[@class='left agency']//text()")

            price = info.xpath(".//div[@class='price']/span/text()")[0]

            price_pre =  info.xpath(".//div[@class='price-pre']/text()")[0]

            look_num = info.xpath(".//div[@class='square']//span[@class='num']/text()")[0]

            one_data = {

                "region":region,

                "zone":zone,

                "meters":meters,

                "where":where,

                "louceng":floor,

                "type":type,

                "xiaoshou":agent,

                "has":has,

                "price":price,

                "price_pre":price_pre,

                "num":look_num

            }

            self._data.append(one_data)  # 添加数据

不一会，数据就爬取的差不多了。

巴特西

Python爬虫入门教程石家庄链家租房数据抓取

1. 写在前面

2. 分析网址

3. 解析网页

最新文章

热门文章