安装scrapy

pip install scrapy

新建项目

(python36) E:\www>scrapy startproject fileDownload
New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in:
E:\www\fileDownload You can start your first spider with:
cd fileDownload
scrapy genspider example example.com (python36) E:\www>
(python36) E:\www>scrapy startproject fileDownload
New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in:
E:\www\fileDownload You can start your first spider with:
cd fileDownload
scrapy genspider example example.com (python36) E:\www>

编辑爬虫提取内容

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule from fileDownload.items import FiledownloadItem class PexelsSpider(CrawlSpider):
name = 'pexels'
allowed_domains = ['www.pexels.com']
start_urls = ['https://www.pexels.com/photo/white-concrete-building-2559175/'] rules = (
Rule(LinkExtractor(allow=r'/photo/'), callback='parse_item', follow=True),
) def parse_item(self, response):
print(response.url)
url = response.xpath("//img[contains(@src,'photos')]/@src").extract()
item = FiledownloadItem()
try:
item['file_urls'] = url
print("爬取到图片列表 " + url)
yield item
except Exception as e:
print(str(e))

配置item

class FiledownloadItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
file_urls = scrapy.Field()

  

setting.py

启用文件管道

'scrapy.pipelines.files.FilesPipeline':2  文件管道

FILES_STORE=''  //存储路径

item里面

file_urls = scrapy.Field()

files = scrapy.field()

爬虫里面 改为file_urls参数传递到管道

重写文件管道 保存文件名为图片原名

pipelines.php里面 新建自己图片管道,继承图片管道

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.files import FilesPipeline
class FiledownloadPipeline(object):
def process_item(self, item, spider):
tmp = item['file_urls']
item['file_urls'] = [] for i in tmp:
if "?" in i:
item['file_urls'].append(i.split('?')[0])
else:
item['file_urls'].append(i)
print(item)
return item class MyFilesPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
file_path = request.url
file_path = file_path.split('/')[-1]
print("下载图片"+ file_path)
return 'full/%s' % (file_path)

setting.py 改为启用自己文件管道

ITEM_PIPELINES = {
'fileDownload.pipelines.FiledownloadPipeline': 1,
'fileDownload.pipelines.MyFilesPipeline': 2,
#'scrapy.pipelines.files.FilesPipeline':2
}

获取套图

# -*- coding: utf-8 -*-
from time import sleep import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule class AngelSpider(CrawlSpider):
name = 'angel'
allowed_domains = ['angelimg.spbeen.com']
start_urls = ['http://angelimg.spbeen.com/'] base_url = "http://angelimg.spbeen.com"
rules = (
Rule(LinkExtractor(allow=r'^http://angelimg.spbeen.com/ang/\d+$'), callback='parse_item', follow=False),
) def parse_item(self, response):
item = response.meta.get('item',False)
if item:
pass
else:
item = {}
item['files'] = []
item['file_urls'] = []
print(response.url)
img_url = response.xpath('.//div[@id="content"]/a/img/@src').extract_first()
item['file_urls'].append(img_url) # 如果有下一页 请求下一页,没有数据丢回管道
next_url = response.xpath('.//div[@class="page"]//a[contains(@class,"next")]/@href').extract_first() if next_url:
next_url = self.base_url + next_url
yield scrapy.Request(next_url,callback=self.parse_item,meta={'item':item})
else:
print(item)
yield item
def parse_next_response(self,response,):
item = response.meta.get('item')
print(item,response.url)

  

  github地址

https://github.com/brady-wang/spider-fileDownload

  

最新文章

  1. vue.js 接收url参数
  2. [Python Day5] 常用模块
  3. 【位运算经典应用】 N皇后问题
  4. python Data type conversation
  5. Saltstack系列3:Saltstack常用模块及API
  6. JS正则表达式使用方法及示例
  7. bzoj 1492 [NOI2007]货币兑换Cash(斜率dp+cdq分治)
  8. JavaScript之图片轮换
  9. 论文阅读(2014-1)----a new collaborative filtering-based recommender system for manufacturing appstore: which applications would be useful to your busines?
  10. CentOS添加RPMforge软件源
  11. Supporting Multiple Screens 翻译 支持各种屏幕(上)
  12. HDU 3698 DP+线段树
  13. powerdesigner for sqlserver的一些实用配置
  14. cortexm内核 栈的8字节对齐及关键字PRESERVE8
  15. Actor模型原理
  16. openwrt下 samba设置
  17. 增加swap分区
  18. Myeclipse、eclipse安装lombok
  19. Linux 目录结构学习与简析 Part1
  20. c++ const 用法总结

热门文章

  1. python开发笔记-pymsslq连接操作SqlServer数据库
  2. 使用evenlet包实现 concurrent.futures.executor包的鸭子类
  3. docker 安装jenkins 发布 asp.net core 2.0
  4. mysql的几个操作
  5. elk使用记录
  6. Mac Pro 2015休眠掉电解决办法
  7. HttpClient get请求获取数据流
  8. 【计算机视觉】黄金标准算法Gold Standard algorithm
  9. Kubernetes 原理架构介绍(一)
  10. JS系列:三元运算符与循环