使用python爬虫爬取链家潍坊市二手房项目

需求分析

需要将潍坊市各县市区页面所展示的二手房信息按要求爬取下来,同时保存到本地。

流程设计

  • 明确目标网站URL( https://wf.lianjia.com/
  • 确定爬取二手房哪些具体信息(字段名)
  • python爬虫关键实现:requests库和lxml库
  • 将爬取的数据存储到CSV或数据库中

实现过程

项目目录

1、在数据库中创建数据表

我电脑上使用的是MySQL8.0,图形化工具用的是Navicat.

数据库字段对应

id-编号、title-标题、total_price-房屋总价、unit_price-房屋单价、

square-面积、size-户型、floor-楼层、direction-朝向、type-楼型、

district-地区、nearby-附近区域、community-小区、elevator-电梯有无、

elevatorNum-梯户比例、ownership-房屋性质


该图显示的是字段名、数据类型、长度等信息。

2、自定义数据存储函数

这部分代码放到Spider_wf.py文件中

通过write_csv函数将数据存入CSV文件,通过write_db函数将数据存入数据库

点击查看代码

import csv
import pymysql #写入CSV
def write_csv(example_1):
csvfile = open('二手房数据.csv', mode='a', encoding='utf-8', newline='')
fieldnames = ['title', 'total_price', 'unit_price', 'square', 'size', 'floor','direction','type',
'BuildTime','district','nearby', 'community', 'decoration', 'elevator','elevatorNum','ownership']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow(example_1) #写入数据库
def write_db(example_2):
conn = pymysql.connect(host='127.0.0.1',port= 3306,user='changziru',
password='ru123321',database='secondhouse_wf',charset='utf8mb4'
)
cursor =conn.cursor()
title = example_2.get('title', '')
total_price = example_2.get('total_price', '0')
unit_price = example_2.get('unit_price', '')
square = example_2.get('square', '')
size = example_2.get('size', '')
floor = example_2.get('floor', '')
direction = example_2.get('direction', '')
type = example_2.get('type', '')
BuildTime = example_2.get('BuildTime','')
district = example_2.get('district', '')
nearby = example_2.get('nearby', '')
community = example_2.get('community', '')
decoration = example_2.get('decoration', '')
elevator = example_2.get('elevator', '')
elevatorNum = example_2.get('elevatorNum', '')
ownership = example_2.get('ownership', '')
cursor.execute('insert into wf (title, total_price, unit_price, square, size, floor,direction,type,BuildTime,district,nearby, community, decoration, elevator,elevatorNum,ownership)'
'values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)',
[title, total_price, unit_price, square, size, floor,direction,type,
BuildTime,district,nearby, community, decoration, elevator,elevatorNum,ownership])
conn.commit()#传入数据库
conn.close()#关闭数据库
3、爬虫程序实现

这部分代码放到lianjia_house.py文件,调用项目Spider_wf.py文件中的write_csv和write_db函数

点击查看代码
#爬取链家二手房详情页信息
import time
from random import randint
import requests
from lxml import etree
from secondhouse_spider.Spider_wf import write_csv,write_db #模拟浏览器操作
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
#随机USER_AGENTS
random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)]
headers = {'User-Agent': random_agent,} class SpiderFunc:
def __init__(self):
self.count = 0
def spider(self ,list):
for sh in list:
response = requests.get(url=sh, params={'param':'1'},headers={'Connection':'close'}).text
tree = etree.HTML(response)
li_list = tree.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]')
for li in li_list:
# 获取每套房子详情页的URL
detail_url = li.xpath('.//div[@class="title"]/a/@href')[0]
try:
# 向每个详情页发送请求
detail_response = requests.get(url=detail_url, headers={'Connection': 'close'}).text except Exception as e:
sleeptime = randint(15,30)
time.sleep(sleeptime)#随机时间延迟
print(repr(e))#打印异常信息
continue
else:
detail_tree = etree.HTML(detail_response)
item = {}
title_list = detail_tree.xpath('//div[@class="title"]/h1/text()')
item['title'] = title_list[0] if title_list else None # 1简介 total_price_list = detail_tree.xpath('//span[@class="total"]/text()')
item['total_price'] = total_price_list[0] if total_price_list else None # 2总价 unit_price_list = detail_tree.xpath('//span[@class="unitPriceValue"]/text()')
item['unit_price'] = unit_price_list[0] if unit_price_list else None # 3单价 square_list = detail_tree.xpath('//div[@class="area"]/div[@class="mainInfo"]/text()')
item['square'] = square_list[0] if square_list else None # 4面积 size_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[1]/text()')
item['size'] = size_list[0] if size_list else None # 5户型 floor_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[2]/text()')
item['floor'] = floor_list[0] if floor_list else None#6楼层 direction_list = detail_tree.xpath('//div[@class="type"]/div[@class="mainInfo"]/text()')
item['direction'] = direction_list[0] if direction_list else None # 7朝向 type_list = detail_tree.xpath('//div[@class="area"]/div[@class="subInfo"]/text()')
item['type'] = type_list[0] if type_list else None # 8楼型 BuildTime_list = detail_tree.xpath('//div[@class="transaction"]/div[@class="content"]/ul/li[5]/span[2]/text()')
item['BuildTime'] = BuildTime_list[0] if BuildTime_list else None # 9房屋年限 district_list = detail_tree.xpath('//div[@class="areaName"]/span[@class="info"]/a[1]/text()')
item['district'] = district_list[0] if district_list else None # 10地区 nearby_list = detail_tree.xpath('//div[@class="areaName"]/span[@class="info"]/a[2]/text()')
item['nearby'] = nearby_list[0] if nearby_list else None # 11区域 community_list = detail_tree.xpath('//div[@class="communityName"]/a[1]/text()')
item['community'] = community_list[0] if community_list else None # 12小区 decoration_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[9]/text()')
item['decoration'] = decoration_list[0] if decoration_list else None # 13装修 elevator_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[11]/text()')
item['elevator'] = elevator_list[0] if elevator_list else None # 14电梯 elevatorNum_list = detail_tree.xpath('//div[@class="base"]/div[@class="content"]/ul/li[10]/text()')
item['elevatorNum'] = elevatorNum_list[0] if elevatorNum_list else None # 15梯户比例 ownership_list = detail_tree.xpath('//div[@class="transaction"]/div[@class="content"]/ul/li[2]/span[2]/text()')
item['ownership'] = ownership_list[0] if ownership_list else None # 16交易权属
self.count += 1
print(self.count,title_list) # 将爬取到的数据存入CSV文件
write_csv(item)
# 将爬取到的数据存取到MySQL数据库中
write_db(item)
#循环目标网站
count =0
for page in range(1,101):
if page <=40:
url_qingzhoushi = 'https://wf.lianjia.com/ershoufang/qingzhoushi/pg' + str(page) # 青州市40
url_hantingqu = 'https://wf.lianjia.com/ershoufang/hantingqu/pg' + str(page) # 寒亭区 76
url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page) # 坊子区
url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page) # 奎文区
url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page) # 高新区
url_jingji = 'https://wf.lianjia.com/ershoufang/jingjijishukaifaqu2/pg' + str(page) # 经济技术85
url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page) # 寿光市 95
url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page) # 潍城区
list_wf = [url_qingzhoushi, url_hantingqu,url_jingji, url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
SpiderFunc().spider(list_wf)
elif page <=76:
url_hantingqu = 'https://wf.lianjia.com/ershoufang/hantingqu/pg' + str(page) # 寒亭区 76
url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page) # 坊子区
url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page) # 奎文区
url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page) # 高新区
url_jingji = 'https://wf.lianjia.com/ershoufang/jingjijishukaifaqu2/pg' + str(page) # 经济技术85
url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page) # 寿光市 95
url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page) # 潍城区
list_wf = [url_hantingqu,url_jingji, url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
SpiderFunc().spider(list_wf)
elif page<=85:
url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page) # 坊子区
url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page) # 奎文区
url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page) # 高新区
url_jingji = 'https://wf.lianjia.com/ershoufang/jingjijishukaifaqu2/pg' + str(page) # 经济技术85
url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page) # 寿光市 95
url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page) # 潍城区
list_wf = [url_jingji, url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
SpiderFunc().spider(list_wf)
elif page <=95:
url_shouguangshi = 'https://wf.lianjia.com/ershoufang/shouguangshi/pg' + str(page) # 寿光市 95
url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page) # 潍城区
url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page) # 坊子区
url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page) # 奎文区
url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page) # 高新区
list_wf = [url_shouguangshi, url_weichengqu, url_fangzi, url_kuiwenqu, url_gaoxin]
SpiderFunc().spider(list_wf)
else:
url_weichengqu = 'https://wf.lianjia.com/ershoufang/weichengqu/pg' + str(page) # 潍城区
url_fangzi = 'https://wf.lianjia.com/ershoufang/fangziqu/pg' + str(page) # 坊子区
url_kuiwenqu = 'https://wf.lianjia.com/ershoufang/kuiwenqu/pg' + str(page) # 奎文区
url_gaoxin = 'https://wf.lianjia.com/ershoufang/gaoxinjishuchanyekaifaqu/pg' + str(page) # 高新区
list_wf = [url_weichengqu, url_fangzi,url_kuiwenqu, url_gaoxin]
SpiderFunc().spider(list_wf)
4、效果展示

总共获取到20826条数据,

我数据库因为要做数据分析,因而作了预处理,获得18031条





最新文章

  1. 2013 duilib入门简明教程 -- 第一个程序 Hello World(3)
  2. &quot;微空间&quot;免费空间很棒哦,很适合中小网站站长
  3. python中配置文件写法
  4. 简单理解javascript的原型prototype
  5. 使用MySQL Migration Toolkit快速将Oracle数据导入MySQL[转]
  6. Apache安全配置
  7. MySQL时间戳和时间格式转换函数
  8. hadoop fs管理文件权限
  9. 转:在MyEclipse下创建Java Web项目 入门(图文并茂)经典教程
  10. Oracle从11.2.0.2开始,数据库补丁包是一个完整安装包(转)
  11. 基于ECharts 的地图例子
  12. WCF通信过程
  13. Java 生成本文文件的时候,Dos格式转成Unix格式
  14. 开始MVC5之旅
  15. Mac系统的终端显示git当前分支
  16. &#39;An instance 0x155e74a0 of class UIWebView was deallocated while key value observers were still registered with it.
  17. MyOD 代码实现
  18. MinFilter(MaxFilter)快速算法C++实现
  19. CentOS 7 NAT模式LVS搭建
  20. 使用react中遇到的问题

热门文章

  1. docker搭建phpswoole实现http服务
  2. Study python_02
  3. shell语法1-概论、注释、变量、字符串
  4. 【七侠传】冲刺阶段--Day2
  5. Rancher 快速构建k8s容器管理平台解决方案(图片见原文链接)
  6. LoadRunner——block(块)技术
  7. [fiddler的使用]添加常用字段(请求耗时,客户端请求时间,IP地址)
  8. JAVA 作业
  9. Java流程控制1
  10. 面向对象ooDay8