【Download error：TOO MANY REQUESTS】&【TypeError：excepted string or buffer】

《用python写网络爬虫》，1.4.4链接爬虫，运行时，遇到错误：

Download error：TOO MANY REQUESTS

Traceback（most recent call last）:

　　File "1.py"，line 52，in(module)

　　　　link_crawler('http://example.webscraping.com'，'/index')

　　File "1.py"，line 34，in link_crawler

　　　　for link in get_links(html):

　　File "1.py"，line 50,in get_links

　　　　return webpage_regex.findall(html)

TypeError：excepted string or buffer

分析：首先定位到异常位置，再设置每次请求发送后的等待时间，可解决一次性向服务器发太多请求！

下图是原代码（即出错的代码）

 # encoding: UTF-8

 import re

 import urlparse

 import urllib2

 def download(url,user_agent='wswp',num_retries=2):

     print 'Downloading:',url

     headers = {'User-agent':user_agent}

     request = urllib2.Request(url,headers=headers)

     try:

         html = urllib2.urlopen(url).read()

     except urllib2.URLError as e:

         print 'Download error:',e.reason    # 输出错误原因

         html = None

         if num_retries > 0:

             if hasattr(e,'code')and 500 <= e.code <600:

             # 当错误提示中包含错误代码而且代码是500~600之间的数字时，执行下列代码

                 return download(url,num_retries-1)

     return html

 def link_crawler(seed_url,link_regex):

     crawl_queue = [seed_url]

     # set函数用于输出不带重复内容的列表（列表中的重复内容会被删掉）

     seen = set(crawl_queue)                             # 访问过得链接

     while crawl_queue:

             url = crawl_queue.pop()

             html = download(url)

             for link in get_links(html):

                 if re.search(link_regex,link):                # 判断link是否符合给定的正则表达式

                     link = urlparse.urljoin(seed_url,link)

                                         if link not in seen:                    # 判断此链接是否在已访问链接列表中

                         seen.add(link)

                         crawl_queue.append(link)

 def get_links(html):

     webpage_regex = re.compile(r'<a[^>]+href=["\'](.*?)["\']',re.IGNORECASE)     #匹配<a href="xxx"> 这样的字符串

     return webpage_regex.findall(html)

 link_crawler('http://example.webscraping.com','/index')

在出错位置加上等待时间（红色标明），如下：

def link_crawler(seed_url,link_regex):

    crawl_queue = [seed_url]

    # set函数用于输出不带重复内容的列表（列表中的重复内容会被删掉）

    seen = set(crawl_queue)                             # 访问过得链接

    while crawl_queue:

        url = crawl_queue.pop()

        html = download(url)

        for link in get_links(html):

            time.sleep(0.01)　　　　　　　　　　　　　　　　　　　　#防止同时请求过多，造成服务器报错if re.search(link_regex,link):                # 判断link是否符合给定的正则表达式

                    link = urlparse.urljoin(seed_url,link)    # 将相对url地址改为绝对url地址

                    if link not in seen:                    # 判断此链接是否在已访问链接列表中

                        seen.add(link)

                        crawl_queue.append(link)

测试：

可正常下载

若提示报错中断，则加入try…exception抛出异常进行调试。

巴特西

【Download error：TOO MANY REQUESTS】&【TypeError：excepted string or buffer】

最新文章

热门文章