场景2的实现:

数据指纹

  • 使用详情页的url充当数据指纹即可。

创建爬虫爬虫文件:

  • cd project_name(进入项目目录)
  • scrapy genspider 爬虫文件的名称(自定义一个名字即可) 起始url
    • (例如:scrapy genspider first www.xxx.com)
  • 创建成功后,会在爬虫文件夹下生成一个py的爬虫文件

进入爬虫文件:

  • cd 爬虫文件的名称(即自定义的名字)

爬虫文件

import scrapy
import redis
import random
from ..items import Zlsdemo2ProItem
class JianliSpider(scrapy.Spider):
name = "jianli"
# allowed_domains = ["www.xxx.com"]
start_urls = ["https://sc.chinaz.com/jianli/free.html"]
# 连接数据库
conn = redis.Redis(
host='127.0.0.1',
port=6379,
)
def parse(self, response):
div_list = response.xpath('//*[@id="container"]/div')
for div in div_list:
title = div.xpath('./p/a/text()').extract_first()
detail_url = div.xpath('./p/a/@href').extract_first()
# print(title, detail_url)
ex = self.conn.sadd('data_id', detail_url)
#建立item对象
item = Zlsdemo2ProItem()
item['title'] = title#将title传给item
if ex == 1:#数据库中没有存储爬取到的数据
print('有最新数据更新,正在采集中......')
#做请求传参,将title通过meta传递给parse_detail函数
yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item}) else:#数据库中已存储爬取到的数据
print('暂无数据更新!') def parse_detail(self,response):
item = response.meta['item']
download_li = response.xpath('//*[@id="down"]/div[2]/ul/li')
download_urls = []
for li in download_li:
download_url_e = li.xpath('./a/@href').extract_first()
download_urls.append(download_url_e)
download_url = random.choice(download_urls)
item['download_url'] = download_url yield item

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class Zlsdemo2ProItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
download_url = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface
from itemadapter import ItemAdapter class Zlsdemo2ProPipeline:
def process_item(self, item, spider): conn = spider.conn
conn.lpush('data1', item) return item

settings

# Scrapy settings for zlsDemo2Pro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = "zlsDemo2Pro" SPIDER_MODULES = ["zlsDemo2Pro.spiders"]
NEWSPIDER_MODULE = "zlsDemo2Pro.spiders" # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = 'ERROR'
LOG_LEVEL = 'WARNING' # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
# "Accept-Language": "en",
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "zlsDemo2Pro.middlewares.Zlsdemo2ProSpiderMiddleware": 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "zlsDemo2Pro.middlewares.Zlsdemo2ProDownloaderMiddleware": 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# "scrapy.extensions.telnet.TelnetConsole": None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"zlsDemo2Pro.pipelines.Zlsdemo2ProPipeline": 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" # Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

最新文章

  1. Java(String)
  2. easyUI datagrid editor扩展dialog
  3. JS创建对象总结
  4. review过去的10年
  5. 转connect() to unix:/var/run/php-fpm.sock failed (11: Resource temporarily unavailable)
  6. Scala中的match(模式匹配)
  7. (1)定义一个接口CanFly,描述会飞的方法public void fly(); (2)分别定义类飞机和鸟,实现CanFly接口。 (3)定义一个测试类,测试飞机和鸟,在main方法中创建飞机对象和鸟对象, 再定义一个makeFly()方法,其中让会飞的事物飞。并在main方法中调用该方法, 让飞机和鸟起飞。
  8. 【转+自己研究】新姿势之Docker Remote API未授权访问漏洞分析和利用
  9. BPMN这点事-BPMN扩展元素
  10. 转:CodeCube提供可共享、可运行的代码示例
  11. Web前端性能优化的14条规则
  12. NHibernate构建一个ASP.NET MVC应用程序
  13. JVM和java应用服务器调优
  14. vue学习记录④(路由传参)
  15. 初学者易上手的SSH-spring 01控制反转(IOC)
  16. Python -- queue队列模块
  17. 编译错误“The run destination My Mac is not valid for Running the scheme '***',解决办法
  18. [蓝桥] 基础练习 数列排序(java)
  19. js 参数声明用var和不用var的区别
  20. shell export 命令

热门文章

  1. linux开放指定端口
  2. 关于我在安装2.6.9版本bochs虚拟机时遇到的问题以及解决过程
  3. 解决移动H5页面的刷组造成件传值数据丢失问题
  4. JS 替代eval方法
  5. VUE+Element+若依随笔001:点击左侧菜单跳转外部链接配置并传参数
  6. 容器之docker基础
  7. Route路径
  8. python 通过win32com操作vcf到outlook中,同时解决乱码问题
  9. Spring框架3--Web
  10. 配置VS Code链接外部gsl库文件