本文获取的字段有为职位名称,公司名称,公司地点,薪资,发布时间

创建爬虫项目

scrapy startproject qianchengwuyou

cd qianchengwuyou

scrapy genspider -t crawl qcwy www.xxx.com

items中定义爬取的字段

import scrapy

class QianchengwuyouItem(scrapy.Item):
# define the fields for your item here like:
job_title = scrapy.Field()
company_name = scrapy.Field()
company_address = scrapy.Field()
salary = scrapy.Field()
release_time = scrapy.Field()

qcwy.py文件内写主程序

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qianchengwuyou.items import QianchengwuyouItem class QcwySpider(CrawlSpider):
name = 'qcwy'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html?']
# https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,7.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
rules = (
Rule(LinkExtractor(allow=r'https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,(\d+).html?'), callback='parse_item', follow=True),
) def parse_item(self, response): list_job = response.xpath('//div[@id="resultList"]/div[@class="el"][position()>1]')
for job in list_job:
item = QianchengwuyouItem()
item['job_title'] = job.xpath('./p/span/a/@title').extract_first()
item['company_name'] = job.xpath('./span[1]/a/@title').extract_first()
item['company_address'] = job.xpath('./span[2]/text()').extract_first()
item['salary'] = job.xpath('./span[3]/text()').extract_first()
item['release_time'] = job.xpath('./span[4]/text()').extract_first()
yield item

pipelines.py文件中写下载规则

import pymysql

class QianchengwuyouPipeline(object):
conn = None
mycursor = None def open_spider(self, spider):
print('链接数据库...')
self.conn = pymysql.connect(host='172.16.25.4', user='root', password='root', db='scrapy')
self.mycursor = self.conn.cursor() def process_item(self, item, spider):
print('正在写数据库...')
job_title = item['job_title']
company_name = item['company_name']
company_address = item['company_address']
salary = item['salary']
release_time = item['release_time']
sql = 'insert into qcwy VALUES (null,"%s","%s","%s","%s","%s")' % (
job_title, company_name, company_address, salary, release_time)
bool = self.mycursor.execute(sql)
self.conn.commit()
return item def close_spider(self, spider):
print('写入数据库完成...')
self.mycursor.close()
self.conn.close()

settings.py文件中打开下载管道和请求头

ITEM_PIPELINES = {
'qianchengwuyou.pipelines.QianchengwuyouPipeline': 300,
}
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2'

运行爬虫,同时写入.json文件

scrapy crawl qcwy -o qcwy.json --nolog

查看数据库是否写入成功,

done.

最新文章

  1. js,css小知识点记录
  2. Too many connections解决方案
  3. 在JavaScript中,arguments是对象的一个特殊属性。
  4. Android Studio 三种添加插件的方式,androidstudio
  5. 移动端JS 触摸事件基础
  6. Caching和Purgeable Memory (译)
  7. 329. Longest Increasing Path in a Matrix
  8. gitHub远程分支创建
  9. C#--比较
  10. Scala写排序可以说是简洁又明了
  11. 文件下载Demo
  12. media_root以及static_root配置
  13. Socket层实现系列 — listen()的实现
  14. 学习spring中遇见的问题
  15. CST2017 安装问题
  16. HDU6333 Harvest of Apples (杭电多校4B)
  17. 分享微信h5支付源码
  18. error C2039: 'SetWindowTextA' : is not a member of 'CString'
  19. 雷林鹏分享:C# 字符串(String)
  20. Python中特殊函数和表达式lambda,filter,map,reduce

热门文章

  1. 将fasta fastq文件线性化处理
  2. IDEA Gradle配置与使用
  3. winform自定义分页控件
  4. Orm 常见查询实例
  5. mapreduce 函数入门 三
  6. Selenium基础教程(二)环境搭建
  7. 适合 ASP.NET Core 的超级-DRY开发
  8. Matlab匿名函数
  9. [转帖]Redis持久化--Redis宕机或者出现意外删库导致数据丢失--解决方案
  10. [转帖]Kubernetes v1.17 版本解读 | 云原生生态周报 Vol. 31