python爬网页中文乱码问题

再用python爬取网页时，用模拟浏览器登陆，得到的中文字符出现乱码，该怎么解决呢？

url = “http://newhouse.hfhouse.com/”

    req = urllib2.Request(url,headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:24.0) Gecko/20100101 Firefox/24.0" })

    reqHtml = urllib2.urlopen(req).read()

    #print reqHtml

    songtasteHtmlEncoding='utf-8'

    soup = BeautifulSoup.BeautifulStoneSoup(reqHtml,fromEncoding=songtasteHtmlEncoding)

    #print soup

    re_h = re.compile('</?\w+[^>]*>')

    s = len(soup.findAll('a',{"class":"area_list"}))

    finda = soup.findAll('a',{"class":"area_list"})

    i = 0

    while(i<s):

        quyuz = re_h.sub('',str(finda[i])).strip()

        try:

            quyu = quyuz.decode('utf-8').encode('gbk')

        except:

            if quyuz[:3] == codecs.BOM_UTF8:

                quyu = quyuz[3:]

                print quyu.decode("utf-8").encode('gbk')

        #quyu = quyu.decode('utf-8').encode('gbk')

        #number = int(filter(str.isdigit, quyuz))

        #dir2 = make_dir(dir1,quyu)

        value = finda[i]['val']

        houseid = finda[i]['href']

        print houseid,value,quyu

总是报eUnicodeEncodeError: 'gbk' codec can't encode character u'\xe7' in position 0: illegal multibyte sequence，网页head里编码是utf-8该怎么办呢？

巴特西

python爬网页中文乱码问题

最新文章

热门文章