robots检测插件编写
2024-10-09 17:12:42
首先先把url分割
url = 'https://www.baidu.com/s?wd=123&rsv_spt=1&rsv_iqid=0x8d22781d000014ad&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_dl=tb&rsv_enter=0&rsv_sug3=3&rsv_sug1=2&rsv_sug7=100&rsv_btype=i&inputT=875&rsv_sug4=875'
ends = "robots.txt"
url = url.split('/')
print(url)
输出
['https:', '', 'www.baidu.com', 's?wd=123&rsv_spt=1&rsv_iqid=0x8d22781d000014ad&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_dl=tb&rsv_enter=0&rsv_sug3=3&rsv_sug1=2&rsv_sug7=100&rsv_btype=i&inputT=875&rsv_sug4=875']
再使用数据清洗,取出前3个,用/连接起来再利用urljoin连接起来ends
from urllib.parse import urljoin
url = 'https://www.baidu.com/s?wd=123&rsv_spt=1&rsv_iqid=0x8d22781d000014ad&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_dl=tb&rsv_enter=0&rsv_sug3=3&rsv_sug1=2&rsv_sug7=100&rsv_btype=i&inputT=875&rsv_sug4=875'
ends = "robots.txt"
url = url.split('/')
url = '/'.join(url[:3])
url = urljoin(url,ends)
print(url)
输出
https://www.baidu.com/robots.txt
现在得到了目标url再获取robots.txt内容
导入再打印出来
import requests
html = requests.get(url)
print(html.text)
打印结果为
G:\python3.8\python.exe "F:/python post/code/RobotsTest.py"
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
User-agent: *
Disallow: /
Process finished with exit code 0
现在把内容写入robots.txt,再读取出来判断我们头是否存在robots里面的禁止列表
现在我们架设头为Googlebot,来判断是否存在
headers = {'user-agent':'Googlebot'}
with open('robots.txt','w',encoding='utf-8') as f:
f.write(html.text)
with open('robots.txt','r',encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
line.strip().replace('\n','')
现在就再line里面,再来判断是否再line里面,然后获取disallow值,存入一个新的里面,这里来个标志flag
lines = f.readlines()
domain = []
flag = False
for line in lines:
line.strip().replace('\n','')
if headers['user-agent'] in line:
flag = True
continue
elif line.startswith('Disallow'):
if flag is True:
domain.append(line.replace('Disallow',''))
elif line is None or line == '':
if flag is True:
break
这里来封装一下完整代码
from urllib.parse import urljoin
import requests
# url = 'https://www.baidu.com/s?wd=123&rsv_spt=1&rsv_iqid=0x8d22781d000014ad&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_dl=tb&rsv_enter=0&rsv_sug3=3&rsv_sug1=2&rsv_sug7=100&rsv_btype=i&inputT=875&rsv_sug4=875'
# ends = "robots.txt"
class Robots:
def __init__(self,url,Agent):
self.Baseurl = url
self.url = url
self.headers = {'user-agent':Agent}
self.ends = 'robots.txt'
self.Dourl()
def Dourl(self):
url = self.url.split('/')
url = '/'.join(url[:3])
url = urljoin(url, self.ends)
self.url = url
def getRobots(self):
html = requests.get(self.url)
with open('robots.txt', 'w', encoding='utf-8') as f:
f.write(html.text)
with open('robots.txt', 'r', encoding='utf-8') as f:
lines = f.readlines()
domain = []
flag = False
for line in lines:
line = line.strip().replace('\n', '')
if self.headers['user-agent'] in line:
flag = True
continue
elif line.startswith('Disallow'):
if flag is True:
domain.append(line.replace('Disallow: ',''))
elif line is None or line == '':
if flag is True:
break
for d in domain:
if d in self.Baseurl:
print("网站禁止爬取")
return False
return True
if __name__ == '__main__':
url = input('url is >>')
agent = input('agent is >>')
# url = 'https://www.baidu.com/s?wd=123&rsv_spt=1&rsv_iqid=0xc6f64e0200000143&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_dl=tb&rsv_enter=0&rsv_sug3=3&rsv_sug1=1&rsv_sug7=100&rsv_btype=i&inputT=993&rsv_sug4=993'
# agent = 'Googlebot'
r = Robots(url, agent)
print(r.getRobots())
F:\python post\code>python RobotsTest.py
url is >>https://www.baidu.com/s?wd=123&rsv_spt=1&rsv_iqid=0x8d22781d000014ad&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_dl=tb&rsv_enter=0&rsv_sug3=3&rsv_sug1=2&rsv_sug7=100&rsv_btype=i&inputT=875&rsv_sug4=875
agent is >>Googlebot
网站禁止爬取
False
F:\python post\code>
最新文章
- CF380C. Sereja and Brackets[线段树 区间合并]
- ZooKeeper 笔记(5) ACL(Access Control List)访问控制列表
- JQuery实现方法,简单示例
- 搭建自己的ngrok服务(国内直接可用http://qydev.com/#)
- Docker源码编译
- 转: PE rva to raw 虚拟偏移地址和文件物理偏移地址
- C#中&;与&;&;的区别
- 【转】VPN服务器配置详解
- inode结构体成员详解
- java知识点梳理
- iOS开发--Runtime的简单使用之关联对象
- java基础解析系列(十一)---equals、==和hashcode方法
- hdu_1711Number Sequence(kmp)
- 51nod 1686 第k大区间
- Android中实现gif动画
- centos 安装python PIL模块
- appium元素获取
- ubuntu 16.04 安装中文语言包
- 50个必备常用的jQuery代码段
- Java关于网络编程回顾