批量获取title
2024-10-21 03:27:47
1 import requests
2 from bs4 import BeautifulSoup
3 import pandas as pd
4 from openpyxl import Workbook
5 import concurrent.futures
6
7 # 读取 .txt 文件中的 URL
8 with open("urls.txt", "r") as file:
9 urls = file.read().splitlines()
10
11 # 存储 URL 和 title
12 data = []
13
14 def fetch_title(url):
15 response = requests.get(url)
16 soup = BeautifulSoup(response.text, "html.parser")
17 title = soup.find("title").text
18 return (url, title)
19
20 with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
21 futures = [executor.submit(fetch_title, url) for url in urls]
22
23 for future in concurrent.futures.as_completed(futures):
24 result = future.result()
25 data.append(result)
26
27 # 将 URL 和 title 写入 Excel 文件
28 df = pd.DataFrame(data, columns=["URL", "Title"])
29
30 book = Workbook()
31 writer = pd.ExcelWriter("titles.xlsx", engine="openpyxl")
32 writer.book = book
33
34 df.to_excel(writer, index=False)
35
36 writer.save()
37 由于是最后一起写入到excel,所以单次URL获取不宜过多
最新文章
- 使用git把项目提交到github
- Jquery.Datatables dom表格定位
- Aspose Cells 添加数据验证(动态下拉列表验证)
- 在Myeclipse中添加User Library,用户自己的库
- [转]DIV+CSS和TABLE的区别
- Java编程思想学习(九) 异常处理
- Java Collections的排序之二
- Installing Python 3.5.2 from source
- USB学习小记-HID类键盘的报告描述符的理解
- ADB——keyevent命令
- 史上最完整的MySQL注入
- SpringDataJpa学习
- HttpAsyncClient的连接池使用
- 工作所用的日常 Git 命令
- JDBC 与 Bean Shell的使用(二)获取值,并且断言
- LeetCode 303. Range Sum Query - Immutable (C++)
- SQL SERVER ENTERPRISE EDITION-CORE VS SERVER+CAL – DEMO ON DIFFERENCES
- 一步一步学习IdentityServer4 (1) 概要配置说明
- 使用jQuery的插件qrcode生成二维码(静态+动态生成)及常见问题解决方法
- Spring.net(二)----初探IOC容器