Scrapy-02-item管道、shell、选择器

Scrapy-02

item管道：

scrapy提供了item对象来对爬取的数据进行保存，它的使用方法和字典类似，不过，相比字典，item多了额外的保护机制，可以避免拼写错误和定义字段错误。
创建的item需要继承scrapy.Item类，并且在里面定义Field字段。(我们爬取的是盗墓笔记，只有文章标题和内容两个字段)
定义item，在item.py中修改：

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # https://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class BooksItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     title = scrapy.Field()

     content = scrapy.Field()

解析response和对item的使用：

 # -*- coding: utf-8 -*-

 import scrapy

 from ..items import BooksItem

 class DmbjSpider(scrapy.Spider):

     name = 'dmbj'

     allowed_domains = ['www.cread.com']

     start_urls = ['http://www.cread.com/chapter/811400395/69162457.html/']

     def parse(self, response):

         item = BooksItem()

         item['title'] = response.xpath('//h1/text()').extract_first()

         item['content'] = response.xpath('//div[@class="chapter_con"]/text()').extract_first()

         yield item

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 class BooksPipeline(object):

     def process_item(self, item, spider):

         with open('files/{}.txt'.format(item['title']), 'w+') as f:

             f.write(item['content'])

         return item

     def open_spider(self, spider):

         # 爬虫启动时调用

         pass

     def close_spider(self, spider):

         # 爬虫关闭时调用

         pass

在parse方法中导入item中定义需要的类，将该类实例化，实例化的类对他进行字典的方式操作，直接对其赋值，字典的key值必须和类中对应的字段名字一直。

然后对其使用yield
在pipline.py里面定义三个方法:
- process_item:
  - 对parse返回的item进行处理，然后在返回出去
- open_spider：
  - 爬虫启动的时候自动调用
- close_spider：
  - 爬虫关闭的时候调用
pipline里面定义的pipline需要使用，就得到setting里面讲ITEM_PIPELINES的字典激活

ITEM_PIPELINES = {
   'books.pipelines.BooksPipeline': 300,
}

shell
- scrapy shell 是scrapy提供的一个交互式的调试工具，如果当前环境中安装了ipython，那么将默认调用ipython，也可以在scrapy.cfg的setting下设置: shell = ipython
- 使用scrapy shell：
  - 终端输入: scrapy shell [url] //url：想爬取的网址，可不添加（也可以是个本地的文件，以路径的方式写入）
- fetch：
  - fetch接受一个url，构成一个新的请求对象，对返回新的response

巴特西

Scrapy-02-item管道、shell、选择器

最新文章

热门文章