环境安装

pip install  requests
pip install beautifulsoup4
pip install pdfkit $ sudo apt-get install wkhtmltopdf # ubuntu
$ sudo yum intsall wkhtmltopdf # centos

脚本

#!/usr/bin/env python3.5
# -*- coding: utf-8 -*-
# @Time : 2019/11/18 下午10:48
# @Author : yon
# @Email : xxx@qq.com
# @File : day1.py.py import os
import re
import time
import logging
import pdfkit
from bs4 import BeautifulSoup
import requests headers = {
# 'Accept': 'application/json, text/javascript, */*; q=0.01',
# 'Accept': '*/*',
# 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-US;q=0.7',
# 'Cache-Control': 'no-cache',
# 'accept-encoding': 'gzip, deflate, br',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
'Referer': 'https://www.google.com/'
}
options= {
'page-size': 'Letter',
'encoding': "UTF-8",
'custom-header': [
('Accept-Encoding', 'gzip')
]
} resp = requests.get('https://www.thisamericanlife.org/687/transcript', headers=headers) soup = BeautifulSoup(resp.content, "html.parser")
body = soup.find("article")
all1 = str(body)
pdfkit.from_string(all1, "/home/yon/Desktop/tt.pdf")

另外一种写法

import os
import re
import time
import logging
import requests
import urllib.request
import os
import stat
import pdfkit
from bs4 import BeautifulSoup # headers = {
# # 'Accept': 'application/json, text/javascript, */*; q=0.01',
# 'Accept': '*/*',
# 'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-US;q=0.7',
# 'Cache-Control': 'no-cache',
# 'accept-encoding': 'gzip, deflate, br',
# 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36',
# 'Referer': 'https://www.google.com/'
# }
#
#
# resp = requests.get('https://www.thisamericanlife.org/687/transcript', headers=headers)
#
# html = resp.content
# with open("thisaericanlife.html", 'wb') as f:
# f.write(html) soup = BeautifulSoup(open("thisaericanlife.html"), "html.parser")
print(soup.article.contents)
print("类型") html = ""
for x in soup.article.contents:
# print(str(x))
html += str(x) print(html) # html = BeautifulSoup(soup.article.contents)
#print(type(html))
# print(html)
pdfkit.from_string(html, "/home/baixiaoxu/desk/tt.pdf")

最新文章

  1. javascript中字符串的比较规则
  2. 大众点评cat系统的搭建笔记
  3. secureCRT会话导入到xshell中的方法
  4. 搭建spring+mybatis+struts2环境的配置文件
  5. BEvent_客制化Event Agent通道(案例)(待整理)
  6. 【转】maven POM.xml 标签详解
  7. c++实例化对象
  8. ASP.NET MVC 学习7、为Model Class的字段添加验证属性(validation attribuate)
  9. (转载)OC学习篇之---概述
  10. js检测文章敏感词
  11. MySQL之数据库和表的基本操作(建立表、删除表、向表中添加字段)
  12. 【图文详解】Hadoop集群搭建(CentOs6.3)
  13. koa-static node服务器设置静态目录
  14. 【XSY2528】道路建设 LCT 可持久化线段树
  15. wstngfw openVpn站点到站点连接示例(共享密钥)
  16. Leetcode 1005. K 次取反后最大化的数组和
  17. java反射以及动态代理的学习
  18. VC++ MFC单文档应用程序SDI下调用glGenBuffersARB(1, &pbo)方法编译通过但执行时出错原因分析及解决办法:glewInit()初始化的错误
  19. iOS开发技巧 - 使用UIDatePicker来选择日期和时间
  20. dirname的用法:获取文件的父级目录路径

热门文章

  1. linux环境下Python搭建
  2. Codeforces 1239A. Ivan the Fool and the Probability Theory
  3. Codeforces 1244G. Running in Pairs
  4. Windows authentication for WCF web services error
  5. 发明专利定稿&递交申请啦,开心
  6. Visual Studio高分屏下Winform界面变形
  7. 对接外网post,get接口封装类库
  8. js判断一个 object 对象是否为空
  9. 移动端tab切换时下划线的滑动效果
  10. CSS 使用calc()获取当前可视屏幕高度