1. 修改redis设置

redis默认处在protection mode, 修改/etc/redis.conf, protected-mode no, 或者给redis设置密码,

bind 127.0.0.1这一行用#注释掉

2. 修改爬虫设置

settings.py加入以下设置

REDIS_URL 为master的ip加上redis的端口号

# For scrapy_redis

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Don't cleanup redis queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True # Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 300
} # Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379 # Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'
REDIS_URL = 'redis://192.168.1.20:6379' #修改成自己的ip和port

3. 修改爬虫代码

  • 使爬虫继承自RedisSpider
from scrapy_redis.spiders import RedisSpider

class DoubanSpider(RedisSpider):
  • 增加一个redis_key属性,这个属性就是start_urls在redis中的key
  • 注释掉start_urls
#!/usr/bin/python3
# -*- coding: utf-8 -*- import scrapy
from scrapy import Request
from project_douban.items import Movie from scrapy_redis.spiders import RedisSpider class DoubanSpider(RedisSpider):
name = 'douban' allowed_domains = ['douban.com'] redis_key = "doubanSpider:start_urls" #start_urls = ['https://movie.douban.com/top250'] headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent' : 'Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
} custom_settings = {
'DEFAULT_REQUEST_HEADERS' : headers,
'REDIRECT_ENABLED' : 'False',
#'LOG_LEVEL' : 'WARNING',
} def parse(self, response):
items = response.xpath('//div[@class="item"]') for item in items:
movie = Movie()
movie['index'] = item.xpath('div//em/text()').extract_first(default = '')
self.logger.info(movie['index']) movie['src'] = item.xpath('div//img/@src').extract_first(default = '')
self.logger.info(movie['src']) movie['title'] = item.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first(default = '') #.xpath('string(.)').extract()).replace(' ','').replace('\xa0',' ').replace('\n',' ')
self.logger.info(movie['title']) movie['star'] = item.xpath('.//span[@class="rating_num"]/text()').extract_first(default = '')
self.logger.info(movie['star']) movie['info'] = item.xpath('.//div[@class="bd"]/p').xpath('string(.)').extract_first(default = '').strip().replace(' ','').replace('\xa0',' ').replace('\n',' ')
self.logger.info(movie['info']) yield movie next_url = response.xpath('//span[@class="next"]/a/@href').extract_first(default = '')
self.logger.info('next_url: ' + next_url)
if next_url:
next_url = 'https://movie.douban.com/top250' + next_url
yield Request(next_url, headers = self.headers)
  • log写入文件(optional)
import logging
import os
import time def get_logger(name, start_time = time.strftime('%Y_%m_%d_%H', time.localtime())):
path = '/var/log/scrapy-redis/'
# path = 'baidu_tieba.log'
if not os.path.exists(path):
os.makedirs(path)
log_path = path + start_time # 创建一个logger
my_logger = logging.getLogger(name)
my_logger.setLevel(logging.INFO)
formatter = logging.Formatter('[%(asctime)s] [%(levelname)s] %(filename)s[line:%(lineno)d] %(message)s', datefmt = '%Y-%m-%d %H:%M:%S') # 创建handler,用于写入日志文件
handler_info = logging.FileHandler('%s_info.log' % log_path, 'a', encoding='UTF-8')
handler_info.setLevel(logging.INFO)
handler_info.setFormatter(formatter)
my_logger.addHandler(handler_info) handler_warning = logging.FileHandler('%s_warning.log' % log_path, 'a', encoding='UTF-8')
handler_warning.setLevel(logging.WARNING)
handler_warning.setFormatter(formatter)
my_logger.addHandler(handler_warning) handler_error = logging.FileHandler('%s_error.log' % log_path, 'a', encoding='UTF-8')
handler_error.setLevel(logging.ERROR)
handler_error.setFormatter(formatter)
my_logger.addHandler(handler_error) return my_logger

Miscellaneous

RedisSpider vs RedisCrawlSpider

直接看源代码,上文本比较

item RedisSpider RedisCrawlSpider
REDIS_START_URLS_AS_SET default: False default: True
继承自Spider 继承自CrawlSpider

scrapy.Spider -> scrapy.CrawlSpider

scrapy.Spider是所有爬虫的基类, scrapy.CrawlSpider基于scrapy.Spider, 增加了rules, 可以设置某种规则,只爬取满足这些规则的网页, RedisCrawlSpider也继承了这一特性

Reference

  1. Scrapy基础——Spider

  2. Scrapy基础——CrawlSpider详解

  3. Scrapy自定义日志和系统日志按级别输出到不同文件

  4. Scrapy笔记07- 内置服务

  5. scrapy中logging的使用

最新文章

  1. [LeetCode] Subsets 子集合
  2. C#读取Excel文件:通过OleDb连接,把excel文件作为数据源来读取
  3. IIS7的集成模式下如何让自定义的HttpModule不处理静态文件(.html .css .js .jpeg等)请求
  4. Cocos2d-JS项目之四:UI界面的优化
  5. Android进阶——声波振幅显示
  6. Meteor错误:TypeError: Meteor.userId is not a function
  7. LBS地理位置距离计算方法之geohash算法
  8. 暑假集训(1)第一弹 -----士兵队列训练问题(Hdu1276)
  9. kaptcha小案例(转)
  10. mysq 日期相减
  11. JavaScript ----------------- 原型式继承
  12. linux下mysql连接jar包的位置在哪里?
  13. css如何li中选中后加上class属性js控制
  14. OpenGL教程之新手上路
  15. Windows服务器安装Memcache缓存服务及PHP扩展
  16. 新装Centos7.2 配置防火墙
  17. 英语词汇周计划(1-1)group 1
  18. JS 获取图片的base64编码
  19. 使用Let's Encrypt搭建永久免费的HTTPS服务
  20. mysql group_concat时间用法

热门文章

  1. CodeForces 875 D High Cry
  2. 201871010134-周英杰《面向对象程序设计(java)》第二周学习总结
  3. 013 turtle程序语法元素分析
  4. Vue 利用指令实现禁止反复发送请求
  5. mycat快速搭建入门
  6. Spring SpringMVC myBatis(简称SSM)理解
  7. 关于ArrayList源码
  8. 语音信号的梅尔频率倒谱系数(MFCC)的原理讲解及python实现
  9. 【Jenkins持续集成(二)】Windows上安装Jenkins教程
  10. [Leetcode][动态规划] 第931题 下降路径最小和