python- www.thisamericanlife.org转pdf
2024-09-06 22:10:09
环境安装
pip install requests
pip install beautifulsoup4
pip install pdfkit
$ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum intsall wkhtmltopdf # centos
脚本
#!/usr/bin/env python3.5
# -*- coding: utf-8 -*-
# @Time : 2019/11/18 下午10:48
# @Author : yon
# @Email : xxx@qq.com
# @File : day1.py.py
import os
import re
import time
import logging
import pdfkit
from bs4 import BeautifulSoup
import requests
headers = {
# 'Accept': 'application/json, text/javascript, */*; q=0.01',
# 'Accept': '*/*',
# 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-US;q=0.7',
# 'Cache-Control': 'no-cache',
# 'accept-encoding': 'gzip, deflate, br',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
'Referer': 'https://www.google.com/'
}
options= {
'page-size': 'Letter',
'encoding': "UTF-8",
'custom-header': [
('Accept-Encoding', 'gzip')
]
}
resp = requests.get('https://www.thisamericanlife.org/687/transcript', headers=headers)
soup = BeautifulSoup(resp.content, "html.parser")
body = soup.find("article")
all1 = str(body)
pdfkit.from_string(all1, "/home/yon/Desktop/tt.pdf")
另外一种写法
import os
import re
import time
import logging
import requests
import urllib.request
import os
import stat
import pdfkit
from bs4 import BeautifulSoup
# headers = {
# # 'Accept': 'application/json, text/javascript, */*; q=0.01',
# 'Accept': '*/*',
# 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-US;q=0.7',
# 'Cache-Control': 'no-cache',
# 'accept-encoding': 'gzip, deflate, br',
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
# 'Referer': 'https://www.google.com/'
# }
#
#
# resp = requests.get('https://www.thisamericanlife.org/687/transcript', headers=headers)
#
# html = resp.content
# with open("thisaericanlife.html", 'wb') as f:
# f.write(html)
soup = BeautifulSoup(open("thisaericanlife.html"), "html.parser")
print(soup.article.contents)
print("类型")
html = ""
for x in soup.article.contents:
# print(str(x))
html += str(x)
print(html)
# html = BeautifulSoup(soup.article.contents)
#print(type(html))
# print(html)
pdfkit.from_string(html, "/home/baixiaoxu/desk/tt.pdf")
最新文章
- javascript中字符串的比较规则
- 大众点评cat系统的搭建笔记
- secureCRT会话导入到xshell中的方法
- 搭建spring+mybatis+struts2环境的配置文件
- BEvent_客制化Event Agent通道(案例)(待整理)
- 【转】maven POM.xml 标签详解
- c++实例化对象
- ASP.NET MVC 学习7、为Model Class的字段添加验证属性(validation attribuate)
- (转载)OC学习篇之---概述
- js检测文章敏感词
- MySQL之数据库和表的基本操作(建立表、删除表、向表中添加字段)
- 【图文详解】Hadoop集群搭建(CentOs6.3)
- koa-static node服务器设置静态目录
- 【XSY2528】道路建设 LCT 可持久化线段树
- wstngfw openVpn站点到站点连接示例(共享密钥)
- Leetcode 1005. K 次取反后最大化的数组和
- java反射以及动态代理的学习
- VC++ MFC单文档应用程序SDI下调用glGenBuffersARB(1, &;pbo)方法编译通过但执行时出错原因分析及解决办法:glewInit()初始化的错误
- iOS开发技巧 - 使用UIDatePicker来选择日期和时间
- dirname的用法:获取文件的父级目录路径
热门文章
- linux环境下Python搭建
- Codeforces 1239A. Ivan the Fool and the Probability Theory
- Codeforces 1244G. Running in Pairs
- Windows authentication for WCF web services error
- 发明专利定稿&;递交申请啦,开心
- Visual Studio高分屏下Winform界面变形
- 对接外网post,get接口封装类库
- js判断一个 object 对象是否为空
- 移动端tab切换时下划线的滑动效果
- CSS 使用calc()获取当前可视屏幕高度