Java爬取先知论坛文章
2024-08-29 11:15:29
Java爬取先知论坛文章
0x00 前言
上篇文章写了部分爬虫代码,这里给出一个完整的爬取先知论坛文章代码。
0x01 代码实现
pom.xml加入依赖:
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.7</version>
</dependency>
<!-- https://mvnrepository.com/artifact/junit/junit -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
实现代码
实现类:
package xianzhi;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.util.List;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
public class Climbimpl implements Runnable {
private String url ;
private int pages;
private String filename;
Lock lock = new ReentrantLock();
public Climbimpl(String url, int pages,String filename) {
this.url = url;
this.pages = pages;
this.filename = filename;
}
public void run() {
File file = new File(this.filename);
boolean mkdir = file.mkdir();
if (mkdir){
System.out.println("目录已创建");
}
lock.lock();
// String url = "https://xz.aliyun.com/";
for (int i = 1; i < this.pages; i++) {
try {
String requesturl = this.url+"?page="+i;
Document doc = null;
doc = Jsoup.parse(new URL(requesturl), 10000);
Elements element = doc.getElementsByClass("topic-title");
List<String> href = element.eachAttr("href");
for (String s : href) {
try{
Document requests = Jsoup.parse(new URL(this.url+s), 100000);
// String topic_content = requests.getElementById("topic_content").text();
String titile = requests.getElementsByClass("content-title").first().text();
System.out.println("已爬取"+titile+"->"+this.filename+titile+".html");
BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream(this.filename+titile+".html"));
bufferedOutputStream.write(requests.toString().getBytes());
bufferedOutputStream.flush();
bufferedOutputStream.close();
}catch (Exception e){
System.out.println("爬取"+this.url+s+"报错"+"报错信息"+e);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
lock.unlock();
}
}
main类:
package xianzhi;
public class TestClimb {
public static void main(String[] args) {
int Threadlist_num = 10; //线程数
String url = "https://xz.aliyun.com/"; //设置url
int pages = 10; //读取页数
String path = "D:\\paramss\\"; //设置保存路径
Climbimpl climbimpl = new Climbimpl(url,pages,path);
for (int i = 0; i < Threadlist_num; i++) {
new Thread(climbimpl).start();
}
}
}
0x03 结尾
该爬虫总体的代码都比较简单。
最新文章
- spring源码分析之定时任务概述
- 最简单的js确认框!
- net-snmp配置:snmp v3的安全配置
- 基础总结篇之一:Activity生命周期
- 【POJ】1451 T9
- php实现MD5加密16位(不要默认的32位)
- TFLearn构建神经网络
- 多线程编程学习笔记——async和await(一)
- msp430系统时钟
- 集腋成裘-11-sql性能优化
- H5 CSS的格式
- Specified version of key is not available (44)
- JSP内置对象——out对象/request对象
- 手机安全卫士-——Splash总结
- Rotate Image leetcode java
- binlog cache size设置是否合理判断
- mysql/mariadb学习记录——创建删除数据库、表的基本命令
- 什么是websoket
- mfc 类
- Access MetaData
热门文章
- Ethical Hacking - GAINING ACCESS(11)
- Python Ethical Hacking - BACKDOORS(7)
- 解决nginx在Linux中已经正常启动,Windows端的浏览器却无法访问的问题
- 题解 洛谷 P4492 【[HAOI2018]苹果树】
- 关于IDEA的一些快捷键操作
- Monster Audio 使用教程(一)入门教程 + 常见问题
- class初探
- Mybatis开启二级缓存(全局缓存)的方法
- PHP boolval() 函数
- CF613D Kingdom and its Cities 虚树 树形dp 贪心