Items

Item objects are simple containers used to collect the scraped data.They provide a dictionary-like api with a convenient syntax for declaring their available fields.

import scrapy;

class Product(scrapy.Item):

  name=scrapy.Field()

  price=scrapy.Field()

  stock=scrapy.Field()

  last_updated=scrapy.Field(serializer=str)

Extending Items

you can extend Items(to add more fields or to change some metadata for some fields)by declaring a subclass of your original Item.

class DiscountedProduct(Product):

  discount_percent=scrapy.Field(serializer=str)

You can also extend fields metadata by using the previous field metadata and appending more values,or changind existing values.

class SpecificProduct(Product):

  name=scrapy.Field(Product.fields['name'],serializer=my_serializer)

Item Objects

1.class scrapy.item.Item([arg])

Return a new Item optionally initialized from the given argument

The only additional attribute provided by Items is:fields

2.Field objects

class scrapy.item.Field([arg])

The Field class is just an alias to the built-in dict class and doesn't provide any extra functionality or attributes.

_______________________________________________________________________________________________________________________________

Built-in spiders reference

Scrapy comes with some useful generic spiders that you can use,to subclass your spiders from .Their aim is to provide convenient functionality for a few common scraping cases.

class scrapy.spider.Spider

This is the simplest spider,and the one from which every other spider must inherit from.

重要属性:

name

A string which defines the name for this spider.it must be unique.This is the most important spider attribute and it is required.

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl.Requests for URLs not belonging to the domain names specified in this list won't be followed if offsiteMiddleware is enabled.

start_urls

A list of URLs where the spider will begin to crawl from,when no particular URLs are specified.

start_requests()

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified.If particular URLs are specified,the make_requests_from_url() is used instead to create the Requests.

make_requests_from_url(url)

A method that receives a URL and returns a Request object to scrape.Unless overridden,this method returns Requests with the parse() method as their callback function.

parse(response)

The parse method is in charge of processing the response and returning scraped data .

log(message[,level,component])

Log a message

closed(reason)

called when the spider closes.

class scrapy.contrib.spiders.CrawlSpider

This is the most commonly used spider for crawling regular websites,as it provides a convenient mechanism for following links by defining a set of rules.

除了继承自Spider的属性,CrawlSpider还提供以下属性。

rules

Which is a list of one or more Rule objects.Each Rule defines a certain behaviour for crawling the site.

关于Rule对象:

class scrapy.contrib.spiders.Rule(link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None)

link_extractor is a Link Extractor object which defines how links will be extracted from each crawled page.

callback is a callable or a string to be called for each link extracted with the specified link_extractor.

注意:when writing crawl spider rules,avoid using parse as callback,since the CrawlSpider uses the parse method itself to implement its logic.

cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.

follow is a boolean which specifies if links should be followed from each response extracted with this rule.If callback is None,follow defaults to true(即继续爬取这个链接),otherwise it default to false.

process_request is a callable or a string which will be called with every request extracted by this rule,and must return a request or None.

------------------------------------------------------------------------------------------------------------------------------------

LinkExtractors are objects whose only purpose is to extract links from web pages(scrapy.http.Response objects).

Scrapy内置的Link Extractors有两个,可以根据需要自己来写。

All availble link extractors classes bundled with scrapy are provided in the scrapy.contrib.linkextractors module.

SgmlLinkExtractor

class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow,...)

The SgmlLinkExtrator extends the base BaseSgmlLinkExtractor by providing additional filters that you can specify to extract links.

allow(a (a list of)regular expression):a single regular expression that the urls must match in order to be extracted,if not given,it will match all links.

最新文章

  1. 并发异步处理队列 .NET 4.5+
  2. noi 8462 大盗阿福
  3. JVM调优的几种策略(转)
  4. 《Java4Android视频教程》学习笔记(三)
  5. CKEditor 集成CKFinder集成
  6. 莫烦sklearn学习自修第八天【过拟合问题】
  7. JDK 1.8 JVM的变化
  8. 一个简单的特效引发的大战之移动开发中我为什么放弃jquery mobile
  9. 【BZOJ1816】[CQOI2010]扑克牌(二分,贪心)
  10. delphi中接口的委托和聚合
  11. JSONP使用及注意事项小结
  12. Linux引导启动顺序
  13. Rochambeau---poj2912||zoj2751(并查集类似于食物链)
  14. Andrew Ng机器学习笔记+Weka相关算法实现(四)SVM和原始对偶问题
  15. 网站SEO优化
  16. mysql 复制表结构 / 从结果中导入数据到新表
  17. 【动态规划】CDOJ1651 Uestc的命运之旅
  18. vmware workstation pro 安装ubantu虚拟机
  19. elasticsearch 7 安装
  20. 后缀数组 hash求LCP BZOJ 4310: 跳蚤

热门文章

  1. Must set property 'expression' before attempting to match
  2. loadClass和forName 的区别
  3. Spring框架针对dao层的jdbcTemplate操作crud之update修改数据库操作
  4. C++之类成员的访问权限详解(一)
  5. TCP socket编程记录(C语言)
  6. react深入 - 手写实现react-redux api
  7. 【HIHOCODER 1052 】基因工程(贪心)
  8. 2016年工作中遇到的问题41-50:Dubbo注册中心奇葩问题,wifi热点坑了
  9. centos挂载本地镜像作为yum源
  10. 一份快速实用的 tcpdump 命令参考手册