scrapy结合selenium抓取武汉市环保局空气质量日报

1.前言

目标网站：武汉市环境保护局（http://hbj.wuhan.gov.cn/viewAirDarlyForestWaterInfo.jspx）。scrapy对接selenium模块抓取空气质量日报数据，需要搭建selenium运行的相应环境，大概搭建方法参见：selenium基本使用；主要是模块的安装和驱动的下载与安装，windows下好像不支持Chorme的无头浏览器，但是有界面的浏览器速度会相对较慢，有条件搭建linux的，用linux下的Chorme headless驱动会快很多；其他的，像火狐等主流浏览器也有对应的驱动，环境搭建差不多，本文用的就是windows下谷歌的驱动（慢就慢点吧）；Phantomjs无头浏览器好像现在不能用了。

注意：Chorme浏览器的驱动下载需要结合自己电脑的浏览器版本下载对应的驱动，不然可能驱动用不了。

2.任务分析

抓取武汉市环境保护局的空气质量日报，该网站数据是采用异步加载的；抓包分析可知，整个过程都是对同一个url进行抓取（注：这是一个点，后续编码需要考虑的）；因为是用selenium点开网页，所以不需要考虑POST还是GET请求。

3.代码逻辑

3.1 创建scrapy项目

基础的项目创建、爬虫创建及创建后项目的文件结构等内容，就不一一写了，基本使用网上有很多博文，直接上正文了。

3.2 明确抓取字段

来到items.py文件，明确待抓取字段。

# -*- coding: utf-8 -*-

import scrapy

class EnvprotectItem(scrapy.Item):

    # 日期

    date = scrapy.Field()

    # 点位

    loca = scrapy.Field()

    # SO2

    SO_2 = scrapy.Field()

    # NO2

    NO_2 = scrapy.Field()

    # 吸入颗粒

    PMIO = scrapy.Field()

    # CO

    CO_1 = scrapy.Field()

    # O3

    O3_d = scrapy.Field()

    # 细颗粒物

    PM25 = scrapy.Field()

    # 空气质量指数

    AQIe = scrapy.Field()

    # 首要污染物

    prmy = scrapy.Field()

    # AQI级别

    AQIl = scrapy.Field()

    # AQI类别

    AQIt = scrapy.Field()

3.3 编写爬虫逻辑

到spiders文件夹下的爬虫文件中，开始编写爬虫逻辑。

从第一次selenium请求后的结果中，解析出共多少条数据，以此确定共多少个页面；

从返回的网页源代码中解析数据；

模拟点击“下一页”，获取数据后，继续解析数据，直至解析完所有页面；

selenium模拟点击操作的代码都在middlewares.py的下载中间件中编写；

scrapy会默认过滤掉重复请求（即目标url相同），我们是对同一目标url爬取，因此注意重复请求的设置。

# -*- coding: utf-8 -*-

import math

import scrapy

from EnvProtect.items import EnvprotectItem

class ProtectenvSpider(scrapy.Spider):

    name = 'ProtectEnv'

    # allowed_domains = ['hbj.wuhan.gov.cn']

    # start_urls = ['http://hbj.wuhan.gov.cn/']

    page=1

    pages=1

    # 目标url

    base_url = 'http://hbj.wuhan.gov.cn/viewAirDarlyForestWaterInfo.jspx'

    def start_requests(self):

        yield scrapy.Request(

            url=self.base_url,

            callback=self.parse,

            dont_filter=True, # 设置不过滤重复请求，scrapy默认过滤重复请求

            meta={'index':1}  # 该参数判断是否为第一次请求

        )

    def parse(self, response):

        """

        第一次请求返回结果中解析出，指定时间段（在middlewares.py文件中指定，后续介绍）内一共有多少条数据；

        由于一直是对同一个页面进行爬取（翻页时url没变，数据变了），数据共多少条（页）确定一次就够了

        :param response:

        :return:

        """

        if response.meta['index']:

            counts = response.xpath("//div[@class='serviceitempage fr']/span[@class='fl']/text()").extract_first()

            counts = int(counts.split(' ')[0])

            self.pages = math.ceil(counts / 22)  # 确定一共多少个页面

        # 解析数据

        node_list = response.xpath('//*[@id="tableForm"]/div/div[3]/table/tbody/tr')[1:]

        for node in node_list:

            item = EnvprotectItem()

            item['date'] = node.xpath("./td[1]/text()").extract_first()

            item['loca'] = node.xpath("./td[2]/text()").extract_first()

            item['SO_2'] = node.xpath("./td[3]/text()").extract_first()

            item['NO_2'] = node.xpath("./td[4]/text()").extract_first()

            item['PMIO'] = node.xpath("./td[5]/text()").extract_first()

            item['CO_1'] = node.xpath("./td[6]/text()").extract_first()

            item['O3_d'] = node.xpath("./td[7]/text()").extract_first()

            item['PM25'] = node.xpath("./td[8]/text()").extract_first()

            item['AQIe'] = node.xpath("./td[9]/text()").extract_first()

            item['prmy'] = node.xpath("./td[10]/text()").extract_first()

            item['AQIl'] = node.xpath("./td[11]/text()").extract_first()

            item['AQIt'] = node.xpath("./td[12]/text()").extract_first()

            yield item

        # 编写爬虫停止运行逻辑

        if self.page < self.pages:

            self.page += 1

            yield scrapy.Request(

                url = self.base_url,

                callback=self.parse,

                dont_filter=True,  # 不过滤重复请求，scrapy默认过滤重复请求

                meta={'index':0}

            )

3.4 编写下载中间件

selenium的所有操作的代码都写在下载中间件中。

# -*- coding: utf-8 -*-

import time

import scrapy

from selenium.webdriver.common.action_chains import ActionChains

from selenium import webdriver

from selenium.webdriver.support.wait import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.by import By

from EnvProtect.settings import USER_AGENTS as ua

class EnvprotectDownloaderMiddleware(object):

    def __init__(self):

        """

        第一页时，不需要点击跳转；其他页面需要模拟点击跳转来获取数据

        """

        self.index = 1  

    def process_request(self, request, spider):

        if request.url == 'http://hbj.wuhan.gov.cn/viewAirDarlyForestWaterInfo.jspx':

            self.driver = webdriver.Chrome()  # 实例化一个谷歌浏览器

            self.driver.get(request.url)  # 请求页面

            wait = WebDriverWait(self.driver, 30)  # 等待页面数据加载，等待30s

            try:

                # 选择城区

                wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "iframepage"))) # 等待iframe标签出现

                options = self.driver.find_element_by_xpath("//select[@id='typedictionary']/option[2]")

                options.click()

                # 选择时间

                self.driver.find_element_by_id('cdateBeginDic').send_keys('2018-11-01')

                self.driver.find_element_by_id('cdateEndDic').send_keys('2019-01-20')

                # 点击查询

                self.driver.find_element_by_xpath("//a[@href='#' and @onclick='toQuery(2);']").click()

                time.sleep(5)

                # 指定页面

                if not self.index == 1:

                    self.index += 1  # 第一个页面不用跳转，其他页面需要跳转过去

                    self.driver.find_element_by_id('goPag').send_keys(str(self.index))

                    self.driver.find_element_by_id('_goPag').click()  # 跳转到该页面

            except:

                print("Error!")

                self.driver.quit()

            # 构造返回response

            html = self.driver.page_source

            self.driver.quit()

            response = scrapy.http.HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8')

            return response

3.5 数据保存逻辑

在pipelines文件中编写数据保存逻辑，此处将数据保存为excel文件。

# -*- coding: utf-8 -*-

from openpyxl import Workbook

class EnvprotectPipeline(object):

    def __init__(self):

        # 创建excel表格保存数据

        self.workbook = Workbook()

        self.booksheet = self.workbook.active

        self.booksheet.append(['日期', '检测点位', '二氧化硫',

                          '二氧化氮', '可吸入颗粒物', '一氧化碳',

                          '臭氧', '细颗粒物', '空气质量指数',

                          '首要污染物', 'AQI指数级别', 'AQI指数类别'])

    def process_item(self, item, spider):

        DATA = [

            item['date'], item['loca'], item['SO_2'],

            item['NO_2'], item['PMIO'], item['CO_1'],

            item['O3_d'], item['PM25'], item['AQIe'],

            item['prmy'], item['AQIl'], item['AQIt']

        ]

        self.booksheet.append(DATA)

        self.workbook.save('./results.xls')

        return item

3.6 其他

1.在settings.py文件中打开对应的pipe通道；

2.关闭robot.txt协议

4.完整代码

参见：github地址

巴特西