1.安装

1.1自行安装python3环境

1.2ide使用pycharm

1.3安装scrapy框架

pip install twisted
pip install lxml
pip install scrapy

2.入门案例

2.1新建项目工程

scrapy startproject mySpider

2.2配置settings文件

ROBOTSTXT_OBEY = True

USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
ua = random.choice(USER_AGENT_LIST)
if ua:
USER_AGENT = ua
print(ua)
else:
USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" #是否遵守robots规则
ROBOTSTXT_OBEY = False
#线程数量
CONCURRENT_REQUESTS = 32
#下载延迟单位秒
DOWNLOAD_DELAY = 3
#cookies开关,建议禁用
COOKIES_ENABLED = False

2.3新建爬虫app

新建app

在spiders目录下

scrapy genspider acg12 "acg12.com"



要建立一个Spider, 你必须用scrapy.Spider类创建一个子类,并确定了三个强制的属性 和 一个方法。

  • name = "" :这个爬虫的识别名称,必须是唯一的,在不同的爬虫必须定义不同的名字。
  • allow_domains = [] 是搜索的域名范围,也就是爬虫的约束区域,规定爬虫只爬取这个域名下的网页,不存在的URL会被忽略。
  • start_urls = () :爬取的URL元祖/列表。爬虫从这里开始抓取数据,所以,第一次下载的数据将会从这些urls开始。其他子URL将会从这些起始URL中继承性生成。
  • parse(self, response) :解析的方法,每个初始URL完成下载后将被调用,调用的时候传入从每一个URL传回的Response对象来作为唯一参数,主要作用如下:
    1. 负责解析返回的网页数据(response.body),提取结构化数据(生成item)
    2. 生成需要下一页的URL请求。

将start_urls的值修改为需要爬取的第一个url

start_urls = ("https://acg12.com/274004/",) #这个网页可能坏掉换一个新的

修改parse()方法

def parse(self, response):
with open("acg12.html", "w",encoding='utf-8') as f:
f.write(response.text)

然后运行一下看看,在mySpider目录下执行:

scrapy crawl acg12

是的,就是 acg12,看上面代码,它是 Acg12Spider 类的 name 属性,也就是使用 scrapy genspider命令的爬虫名。

最新文章

  1. lightbox使用
  2. Yii2-多表关联的用法示例
  3. WindowManager 实现悬浮窗 详解
  4. 使用HttpFileServer自建下载服务器
  5. ACM 找点
  6. MYSQL INSERT INTO SELECT 不插入重复数据
  7. 初学Node(二)package.json文件
  8. http协议本身能获取客户端Mac地址问题
  9. duilib让不同的容器使用不同的滚动条样式
  10. Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  11. SSH三大框架整合使用的配置文件 注解实现
  12. 浅谈POSIX线程的私有数据
  13. c# 程序结构
  14. Day2------字符编码
  15. LeetCode --> 771. Jewels and Stones
  16. 【Python 19】BMR计算器3.0(字符串分割与格式化输出)
  17. vue_VueRouter 路由_路由器管理n个路由_并向路由组件传递数据_新标签路由_编程式路由导航
  18. msql事务与引擎
  19. 用反向代理nginx proxy_pass配置解决ie8 ajax请求被拦截问题 ie8用nginx代理实现跨域请求访问 nginx405正向代理request_uri
  20. sas和ssd盘写入数据效率对比

热门文章

  1. Exceptionless—本地部署
  2. Apache Tomcat 远程代码执行漏洞(CVE-2019-0232)漏洞复现
  3. Kettle6.1连接MongoDB报错
  4. P3067 [USACO12OPEN]平衡的奶牛群(折半暴搜)
  5. 小程序---电影商城---第三方组件 vant(vant weapp)
  6. python 做一个简单的登录接口
  7. spark集群搭建(三台虚拟机)——kafka集群搭建(4)
  8. pat 1092 To Buy or Not to Buy(20 分)
  9. nyoj 78-圈水池 (凸包)
  10. nyoj 813-对决 (i*j == k)