python获取知乎日报另存为txt文件

前言

拿来练手的，比较简单（且有bug），欢迎交流~

功能介绍

抓取当日的知乎日报的内容，并将每篇博文另存为一个txt文件，集中放在一个文件夹下，文件夹名字为当日时间。

使用的库

re，BeautifulSoup，sys，urllib2

注意事项

1.运行环境是Linux，python2.7.x，想在win上使用直接改一下里边的命令就可以了

2.bug是在处理 “如何正确吐槽”的时候只能获取第一个（懒癌发作了）

3.直接获取（如下）内容是不可以的，知乎做了反抓取的处理

urllib2.urlop(url).read()

所以加个Headers就可以了

4.因为zhihudaily.ahorn.me这个网站时不时挂掉，所以有时候会出现错误

 def getHtml(url):

     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}

     request=urllib2.Request(url,None,header)

     response=urllib2.urlopen(request)

     text=response.read()

     return text

4.在做内容分析的时候可以直接使用re，也可以直接调用BeautifulSoup里的函数（我对正则表达式发怵，所以直接bs），比如

 def saveText(text):

     soup=BeautifulSoup(text)

     filename=soup.h2.get_text()+".txt"

     fp=file(filename,'w')

     content=soup.find('div',"content")

     content=content.get_text()

show me the code

 #Filename:getZhihu.py

 import re

 import urllib2

 from bs4 import BeautifulSoup

 import sys

 reload(sys)

 sys.setdefaultencoding("utf-8")

 #get the html code

 def getHtml(url):

     header={'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:14.0) Gecko/20100101 Firefox/14.0.1','Referer' : '******'}

     request=urllib2.Request(url,None,header)

     response=urllib2.urlopen(request)

     text=response.read()

     return text

 #save the content in txt files

 def saveText(text):

     soup=BeautifulSoup(text)

     filename=soup.h2.get_text()+".txt"

     fp=file(filename,'w')

     content=soup.find('div',"content")

     content=content.get_text()

 #   print content #test

     fp.write(content)

     fp.close()

 #get the urls from the zhihudaily.ahorn.com

 def getUrl(url):

     html=getHtml(url)

 #   print html

     soup=BeautifulSoup(html)

     urls_page=soup.find('div',"post-body")

 #   print urls_page

     urls=re.findall('"((http)://.*?)"',str(urls_page))

     return urls

 #main() founction

 def main():

     page="http://zhihudaily.ahorn.me"

     urls=getUrl(page)

     for url in urls:

         text=getHtml(url[0])

         saveText(text)

 if __name__=="__main__":

     main()

巴特西

python获取知乎日报另存为txt文件

最新文章

热门文章