python3 获取博彩网站页面下所有域名(批量)
2024-09-02 08:51:50
已有的域名信息
详细实现过程如下
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup as Bs4
from urllib.parse import urlparse
headers= {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
#打开域名文件1.txt
def new_url():
url_list = []
bo = open("1.txt","r")
for i in bo:
url_list.append(i.replace("\n",""))
return(url_list)
#数据处理
def get_url():
head_url = new_url()
num = 0
for i in head_url: #按行遍历数据
num = num +1
print("***********************************"+ i +"***********************************")
# head_url = "https://www.tkcp.hk/"
try:
response = requests.get(url="http://"+i,headers=headers)
response.encoding = 'gb2312'
soup = Bs4(response.text,"lxml")
# print(soup)
htmls = soup.find_all("a") #获取页面中的所有a标签
# print(htmls)
urls = []
new_urls = []
for html in htmls:
url = html.get("href") #获取页面中所有含"href"的字符串
urls.append(url.replace('\n',''))
qc_urls = set(urls)
for url in qc_urls: #处理数据,得到域名地址
if "http" in url:
res = urlparse(url)
# print("返回对象:", res)
# print("域名", res.netloc)
domain = res.netloc
new_urls.append(domain)
qc_new_urls = set(set(new_urls))
#print("***********************************"+num+"***********************************")
print(set(qc_new_urls)) #去重
for j in set(qc_new_urls):
# print(j)
with open("url_v1.txt","a+",encoding="utf-8") as f:
f.write(j+"\n")
except Exception as e:
print("链接无法访问")
result_list = []
result = open("./url_v1.txt","r")
for r in result.readlines():
result_list.append(r.replace("\n",""))
for x in set(result_list): #二次数据处理,去掉重复数据
with open("url_end_V.txt","a+",encoding="utf-8") as f:
print(x)
f.write(x+"\n")
if __name__=="__main__":
get_url()
最新文章
- JavaScript 构造函数与原型链
- 微信小程序开发POST请求
- git无法定位程序输入点libiconv
- Atitit 编程语言常用算法attilax总结
- 【转】javascript中this的四种用法
- Code First 数据注释
- 项目后台判断session过期的页面代码
- 如何让你的 Asp.Net Web Api 接口,拥抱支持跨域访问。
- Android JNI之C/C++层调用JAVA
- 获取contenteditable的内容 对html进行处理 兼容 chrome、IE、Firefox
- LINUX编程学习笔记(十四) 创建进程与 父子进程内存空间
- 常见的if语句shell脚本
- JMM规范
- spring javaconfig druidsource
- Java基础——Instanceof 运算符
- Web应用安全测试
- openflow流表分析(草稿)
- python之路--MRO和C3算法
- Python学习(三十一)—— Django之路由系统
- git stash错误小记
热门文章
- c#小灶——9.算术运算符
- jackson学习之十(终篇):springboot整合(配置类)
- 从网络I/O模型到Netty,先深入了解下I/O多路复用
- 鸟哥的linux私房菜——第十三章学习(Linux 帐号管理与 ACLL 权限设置)
- Automatic merge failed; fix conflicts and then commit the result.解决方法
- Leetcode(1)-两数之和
- K8s(一)----容器编排工具基础概念
- hdu1228双指针
- 机器学习(四):通俗理解支持向量机SVM及代码实践
- git push bug