杂谈:

之前用requests模块爬取了美女图片,今天用scrapy框架实现了一遍。

(图片尺度确实大了点,但老衲早已无恋红尘,权当观赏哈哈哈)

Item:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy class GirlpicItem(scrapy.Item):
title = scrapy.Field()
image = scrapy.Field()
index = scrapy.Field()

Spider:

#coding:utf-8
from scrapy.spiders import Spider
from scrapy.http import Request
from scrapy.selector import Selector
from girlpic.items import GirlpicItem
import scrapy
import sys
reload(sys)
sys.setdefaultencoding('utf-8') class GirlpicSipder(Spider):
name = 'girlpic'
allowed_domains = [] # 允许的域名
start_urls = ["http://www.mzitu.com/all/"] def parse(self, response):
groups = response.xpath("//div[@class='main-content']//ul[@class='archives']//a")
count = 0
for group in groups:
count = count + 1
if count > 5:
return #此处小心,不要用os.exit(0)
groupUrl = group.xpath('@href').extract()[0]
title = group.xpath("text()").extract()[0]
request = scrapy.Request(url=groupUrl, callback=self.getGroup, meta={'title': title,'groupUrl':groupUrl}, dont_filter=True)
yield request def getGroup(self, response):
maxIndex = response.xpath("//div[@class='pagenavi']//span/text()").extract()[-2]
for index in range(1, int(maxIndex) + 1):
pageUrl = response.meta['groupUrl']+'/'+str(index)
meta = response.meta
meta['index'] = index
request = scrapy.Request(url=pageUrl, callback=self.getPage, meta=meta, dont_filter=True)
yield request def getPage(self, response):
imageurl = response.xpath("//div[@class='main-image']//img/@src").extract()[0] # 获取图片url
request = scrapy.Request(url=imageurl, callback=self.FormItem, meta=response.meta,dont_filter=True)
yield request def FormItem(self, response):
title = response.meta['title']
index = response.meta['index']
image = response.body
item = GirlpicItem(title=title,index=index,image=image)
yield item

PipeLine:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import os
import codecs
import sys
reload(sys)
sys.setdefaultencoding('utf-8') class GirlpicPipeline(object): def __init__(self):
self.dirpath = u'D:\学习资料'
if not os.path.exists(self.dirpath):
os.makedirs(self.dirpath) def process_item(self, item, spider):
title = item['title']
index = item['index']
image = item['image']
groupdir = os.path.join(self.dirpath, title)
if not os.path.exists(groupdir):
os.makedirs(groupdir)
imagepath = os.path.join(groupdir, str(index) + u'.jpg')
file = codecs.open(imagepath, 'wb')
file.write(image)
file.close()
return item

最新文章

  1. iperf交叉编译:
  2. 【bzoj1798】维护序列
  3. Python中整数和浮点数
  4. php表单数据验证类
  5. ant脚本打jar包 自动获取时间以及项目svn版本号
  6. 【待填坑】bzoj上WC的题解
  7. 火狐无法访问本机IIS部署的网站,弹出:此地址使用了一个通常用于网络浏览以外目的的端口.出于安全原因,Firefox 取消了该请求 的解决办法
  8. jquery方法详解
  9. 二分法经典习题——HDU1969
  10. 利用GPGPU计算大规模群落仿真行为
  11. 关于APIcloud中的登录与注册的简单实现
  12. mysql判断条件不存在插入存在更新某字段
  13. java 枚举2
  14. 音频相关基本概念,音频处理及编解码基本框架和原理以及音、重采样、3A等音频处理(了解概念为主)
  15. Java IO模型
  16. R语言学习——处理数据对象的实用函数
  17. MySQL80修改密码
  18. go标准库的学习-mime/multipart
  19. 在postgresqlz中查看与删除索引
  20. VS PDB文件

热门文章

  1. 关于Web项目的pom文件处理
  2. Android Studio中利用JavaDoc生成项目API文档
  3. 面试宝典之预处理、const与sizeof
  4. linux 跟踪工具
  5. uboot之run_command简单分析
  6. MagicZoom bug-Strict Standards: Only variables should be assigned by reference Error
  7. [python学习] 简单爬取图片站点图库中图片
  8. 基于zookeeper或redis实现分布式锁
  9. 1355: [Baltic2009]Radio Transmission[循环节]
  10. mac sublime text 3 add ctags plugin