【Python】第一个爬虫

 import urllib.request

 import re

 class DownPic:

     def __init__(self,url,re_str):

         self.url = url

         self.re_str = re_str

     def getHtml(self,url):

         page = urllib.request.urlopen(url)

         html = page.read()

         return str(html)

     def downloadPic(self):

         imgre = re.compile(self.re_str) #构造正则

         html = self.getHtml(self.url) #读取界面

         imglist = re.findall(imgre,html)

         x = 0

         for imgurl in imglist:

             print(imgurl)

             try:

                 urllib.request.urlretrieve(imgurl,"../data/%s.jpg" % x) # 将图片取到本地

             except:

                 print("error")

             x += 1

上面是一个类，传入两个参数，一个是网页，一个是要匹配的图片的地址

下面是调用：

 from downpic import DownPic

 downPic = DownPic("http://tieba.baidu.com/p/2460150866",r'src="(https://imgsa.baidu.com.+?\.jpg)" pic_ext')

 downPic.downloadPic()

 print("over")

从上面可以看到，一个简单爬虫的基本步骤是：
1、读取界面的HTML

2、用正则去获取到目标链接

3、下载

巴特西

【Python】第一个爬虫

最新文章

热门文章