Java爬取先知论坛文章

0x00 前言

上篇文章写了部分爬虫代码,这里给出一个完整的爬取先知论坛文章代码。

0x01 代码实现

pom.xml加入依赖:

<dependencies>

        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.7</version>
</dependency> <!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency> </dependencies>

实现代码

实现类:

package xianzhi;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements; import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock; public class Climbimpl implements Runnable {
private String url ;
private int pages;
private String filename; Lock lock = new ReentrantLock(); public Climbimpl(String url, int pages,String filename) {
this.url = url;
this.pages = pages;
this.filename = filename;
} public void run() {
File file = new File(this.filename); boolean mkdir = file.mkdir(); if (mkdir){
System.out.println("目录已创建");
} lock.lock(); // String url = "https://xz.aliyun.com/"; for (int i = 1; i < this.pages; i++) {
try { String requesturl = this.url+"?page="+i;
Document doc = null;
doc = Jsoup.parse(new URL(requesturl), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(this.url+s), 100000);
// String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println("已爬取"+titile+"->"+this.filename+titile+".html"); BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(this.filename+titile+".html"));
bufferedOutputStream.write(requests.toString().getBytes());
bufferedOutputStream.flush();
bufferedOutputStream.close(); }catch (Exception e){
System.out.println("爬取"+this.url+s+"报错"+"报错信息"+e);
}
} } catch (IOException e) {
e.printStackTrace();
} }
lock.unlock(); }
}

main类:

package xianzhi;

public class TestClimb {
public static void main(String[] args) {
int Threadlist_num = 10; //线程数
String url = "https://xz.aliyun.com/"; //设置url
int pages = 10; //读取页数
String path = "D:\\paramss\\"; //设置保存路径 Climbimpl climbimpl = new Climbimpl(url,pages,path);
for (int i = 0; i < Threadlist_num; i++) {
new Thread(climbimpl).start(); }
}
}

0x03 结尾

该爬虫总体的代码都比较简单。

最新文章

  1. spring源码分析之定时任务概述
  2. 最简单的js确认框!
  3. net-snmp配置:snmp v3的安全配置
  4. 基础总结篇之一:Activity生命周期
  5. 【POJ】1451 T9
  6. php实现MD5加密16位(不要默认的32位)
  7. TFLearn构建神经网络
  8. 多线程编程学习笔记——async和await(一)
  9. msp430系统时钟
  10. 集腋成裘-11-sql性能优化
  11. H5 CSS的格式
  12. Specified version of key is not available (44)
  13. JSP内置对象——out对象/request对象
  14. 手机安全卫士-——Splash总结
  15. Rotate Image leetcode java
  16. binlog cache size设置是否合理判断
  17. mysql/mariadb学习记录——创建删除数据库、表的基本命令
  18. 什么是websoket
  19. mfc 类
  20. Access MetaData

热门文章

  1. Ethical Hacking - GAINING ACCESS(11)
  2. Python Ethical Hacking - BACKDOORS(7)
  3. 解决nginx在Linux中已经正常启动,Windows端的浏览器却无法访问的问题
  4. 题解 洛谷 P4492 【[HAOI2018]苹果树】
  5. 关于IDEA的一些快捷键操作
  6. Monster Audio 使用教程(一)入门教程 + 常见问题
  7. class初探
  8. Mybatis开启二级缓存(全局缓存)的方法
  9. PHP boolval() 函数
  10. CF613D Kingdom and its Cities 虚树 树形dp 贪心