抓取漫画的网址是:sf互动传媒

抓取漫画的由来也是看了知乎上有人说用爬取漫画,然后自己也玩玩

首页中每个漫画的url是类似这样存储的:

<tr>
<td height="30" align="center" bgcolor="#FFFFFF">
<a href="http://comic.sfacg.com/HTML/KOL/" target="_blank">K.O.I 偶像之王</a>
</td>
</tr>

然后用lxml通过cssselect(tr>td>a)将能用到的解析出来,然后解析出来会有很多其他的页面的url和信息,然后我是通过url中包含"/mh/"或者是"/HTML/"进行过滤的

比较蠢的办法了

然后通过对象,将过滤出来的漫画的url和漫画的名字保存在一个这样的类中,然后通过列表进行存储

class Cartoon():
url = None
name = None

然后用随便一个漫画作为例子:勇者赫鲁库

漫画一共很多章,并且所有章的信息都在如下标签中包含

<ul class="serialise_list Blue_link2">....</ul>

然后通过BS将每一章的信息进行存储,然后可以看到其完整的url是:http://comic.sfacg.com/HTML/YZHLK/096/

然后每一章会有很多的page,并且每一章的内容是ajax进行加载的,然后从检查->网络中可以看到有一个这样的请求

然后请求的response里面包含本章的所有图片,然后只需将在每一章的页面中将.js接口找到,然后将本章的图片信息存储,然后就可以得到本章的图片信息,然后存储到本地

# -*- coding: utf-8 -*-
import re
import urllib
import urllib2
import os
import stat
import itertools
import re
import sys
import requests
import json
import time
import socket
import urlparse
import csv
import random
from datetime import datetime, timedelta
import lxml.html from zipfile import ZipFile
from StringIO import StringIO
from downloader import Downloader
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
from itertools import product
import sys
reload(sys)
sys.setdefaultencoding('utf8')
URL = 'http://comic.sfacg.com'
picture = 'http://coldpic.sfacg.com' class Cartoon():
url = None
name = None def download(url, user_agent='wswp', num_try=2): headers = {'User_agent': user_agent}
request = urllib2.Request(url, headers=headers)
try:
html = urllib2.urlopen(request).read()
except urllib2.URLError as e:
print 'Download error', e.reason
html = None
if num_try > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
return download(url, user_agent, num_try - 1)
elif e.code == 403:
return None
return html def get_section_url(url):
html = download(url)
if html == None:
return None
soup = BeautifulSoup(html, "html.parser")
results = soup.find_all(name='ul', attrs={'class': 'serialise_list Blue_link2'})
res = r'<a.*?href="([^"]*)".*?>([\S\s]*?)</a>'
links = re.findall(res, str(results),re.S | re.M)
return links def get_section_page(url): html = download(url)
if html == None:
return None
soup = BeautifulSoup(html, "html.parser")
results = soup.find_all(name='script', attrs={'type': 'text/javascript'})
tt = len(results)
js = results[tt-1]
mm = js.get('src')
if mm == None:
result = soup.find_all(name='script', attrs={'language': 'javascript'})
js1 = result[1]
mm = js1.get('src')
html1 = download(URL+mm)
list = html1.split(';')
List = []
for each in list:
if 'picAy[' in each:
src = each.split('=')
List.append(picture+src[1][2:-1]) return List def download_cartoon(url, cartoon_name,Section,num): path = "自己定义的路径"+cartoon_name if not os.path.exists(path):
os.mkdir(path)
path = path + "/"+Section
if not os.path.exists(path):
os.mkdir(path)
content = requests.get(url).content
with open(path + '/' + str(num) + '.jpg', 'wb') as f:
f.write(content)
print "Downloading cartoon_name " + path + str(num)+ "下载完成"
f.close() if __name__ == '__main__':
cartoon_list = [] html = download(URL)
tree = lxml.html.fromstring(html)
results = tree.cssselect('tr > td > a')
for each in results:
ti = each.get('href')
if '/mh/' in ti or '/HTML/' in ti:
if each.text_content() != "":
cartoon = Cartoon()
cartoon.url = each.get('href')
cartoon.name = each.text_content().replace(' ','')
cartoon_list.append(cartoon) for each in cartoon_list:
print each.url
print each.name
links = get_section_url(each.url)
links = list(reversed(links))
section = 0
for link in links:
ul = URL + link[0]
List = []
List = get_section_page(ul)
section = section + 1
Section = r'第'+ str(section) + r'章'
num = 1
for mm in List:
#print mm
download_cartoon(mm,each.name,Section,num)
num = num + 1
print each.name + Section + "下载完成"+str(num-1)+"张"

最新文章

  1. 图解说明——究竟什么是Windows句柄
  2. CSS 后代选择器
  3. SQL中的共享锁分析及如何解锁
  4. 基于HTML5技术的电力3D监控应用(三)
  5. [转载]两个半小时学会Perl
  6. Android是如何绘制View的
  7. spring 注入失败
  8. Best Time to Buy and Sell Stock III 解答
  9. Understanding responsibilities is key to good object-oriented design(转)
  10. 使用Bootstrap
  11. ubuntu12.04的NFS配置
  12. 4.Apache POI使用详解
  13. Oracle 11g 中SQL性能优化新特性之SQL性能分析器(SQLPA)
  14. mpvue——另类支持v-html
  15. 【学习】数据聚合和分组运算【groupby】
  16. leetcode — edit-distance
  17. 二叉树的遍历--C#程序举例二叉树的遍历
  18. centos7安装Apache
  19. 呼叫转接ECT
  20. IntelliJ IDEA 缺少 javax 包 支持

热门文章

  1. 洛谷 P2664 树上游戏 解题报告
  2. [ACM]Codeforces Round #534 (Div. 2)
  3. 《JavaScript高级程序设计(第三版)》-3
  4. 03-树3. Tree Traversals Again (25)将先序遍历和中序遍历转为后序遍历
  5. 如何将下载的web工程导入到eclipse中使用
  6. Kruskal-Wallis test
  7. 给上传文件的input控件“美容”
  8. Jquery validate验证表单时多个name相同的元素只验证第一个的问题
  9. JAVA多线程提高七:Callable与Future的应用
  10. JAVA中反射机制五(JavaBean的内省与BeanUtils库)