java从零到变身爬虫大神

刚开始先从最简单的爬虫逻辑入手

爬虫最简单的解析面真的是这样

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import java.io.IOException;

public class Test {

    public static void Get_Url(String url) {

        try {

         Document doc = Jsoup.connect(url)

          //.data("query", "Java")

          //.userAgent("头部")

          //.cookie("auth", "token")

          //.timeout(3000)

          //.post()

          .get();

        } catch (IOException e) {

              e.printStackTrace();

        }

    }

}

这只是一个函数而已

那么在下面加上：

  //main函数

     public static void main(String[] args) {

         String url = "...";

        Get_Url(url);

     }

哈哈，搞定

就是这么一个爬虫了

太神奇

但是得到的只是网页的html页面的东西

而且还没筛选

那么就筛选吧

public static void Get_Url(String url) {

        try {

         Document doc = Jsoup.connect(url)

          //.data("query", "Java")

          //.userAgent("头部")

          //.cookie("auth", "token")

          //.timeout(3000)

          //.post()

          .get();

        //得到html的所有东西

        Element content = doc.getElementById("content");

        //分离出html下<a>...</a>之间的所有东西

        Elements links = content.getElementsByTag("a");

        //Elements links = doc.select("a[href]");

        // 扩展名为.png的图片

        Elements pngs = doc.select("img[src$=.png]");

        // class等于masthead的div标签

        Element masthead = doc.select("div.masthead").first();

        for (Element link : links) {

              //得到<a>...</a>里面的网址

              String linkHref = link.attr("href");

              //得到<a>...</a>里面的汉字

              String linkText = link.text();

              System.out.println(linkText);

            }

        } catch (IOException e) {

              e.printStackTrace();

        }

    }

那就用上面的来解析一下我的博客园

解析的是<a>...</a>之间的东西

看起来还不错吧

-------------------------------我是一根牛逼的分割线-------------------------------

其实还有另外一种爬虫的方法更加好

他能批量爬取网页保存到本地

先保存在本地再去正则什么的筛选自己想要的东西

这样效率比上面的那个高了很多

看代码！

//将抓取的网页变成html文件，保存在本地

    public static void Save_Html(String url) {

        try {

            File dest = new File("src/temp_html/" + "保存的html的名字.html");

            //接收字节输入流

            InputStream is;

            //字节输出流

            FileOutputStream fos = new FileOutputStream(dest);

            URL temp = new URL(url);

            is = temp.openStream();

            //为字节输入流加缓冲

            BufferedInputStream bis = new BufferedInputStream(is);

            //为字节输出流加缓冲

            BufferedOutputStream bos = new BufferedOutputStream(fos);

            int length;

            byte[] bytes = new byte[1024*20];

            while((length = bis.read(bytes, 0, bytes.length)) != -1){

                fos.write(bytes, 0, length);

            }

            bos.close();

            fos.close();

            bis.close();

            is.close();

        } catch (IOException e) {

            e.printStackTrace();

        }

    }

这个方法直接将html保存在了文件夹src/temp_html/里面

在批量抓取网页的时候

都是先抓下来，保存为html或者json

然后在正则什么的进数据库

东西在本地了，自己想怎么搞就怎么搞

反爬虫关我什么事

上面两个方法都会造成一个问题

这个错误代表

这种爬虫方法太low逼

大部分网页都禁止了

所以，要加个头

就是UA

方法一那里的头部那里直接

userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

方法二间接加：

 URL temp = new URL(url);

 URLConnection uc = temp.openConnection();

uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");

  is = temp.openStream();

加了头部，几乎可以应付大部分网址了

-------------------------------我是一根牛逼的分割线-------------------------------

将html下载到本地后需要解析啊

解析啊看这里啊

//解析本地的html

    public static void Get_Localhtml(String path) {

        //读取本地html的路径

        File file = new File(path);

        //生成一个数组用来存储这些路径下的文件名

        File[] array = file.listFiles();

        //写个循环读取这些文件的名字

        for(int i=0;i<array.length;i++){

            try{

                if(array[i].isFile()){

                //文件名字

                System.out.println("正在解析网址：" + array[i].getName());

                //下面开始解析本地的html

                Document doc = Jsoup.parse(array[i], "UTF-8");

                //得到html的所有东西

                Element content = doc.getElementById("content");

                //分离出html下<a>...</a>之间的所有东西

                Elements links = content.getElementsByTag("a");

                //Elements links = doc.select("a[href]");

                // 扩展名为.png的图片

                Elements pngs = doc.select("img[src$=.png]");

                // class等于masthead的div标签

                Element masthead = doc.select("div.masthead").first();

                for (Element link : links) {

                      //得到<a>...</a>里面的网址

                      String linkHref = link.attr("href");

                      //得到<a>...</a>里面的汉字

                      String linkText = link.text();

                      System.out.println(linkText);

                        }

                    }

                }catch (Exception e) {

                    System.out.println("网址：" + array[i].getName() + "解析出错");

                    e.printStackTrace();

                    continue;

                }

        }

    }

文字配的很漂亮

就这样解析出来啦

主函数加上

//main函数

    public static void main(String[] args) {

        String url = "http://www.cnblogs.com/TTyb/";

        String path = "src/temp_html/";

        Get_Localhtml(path);

    }

那么这个文件夹里面的所有的html都要被我解析掉

好啦

3天java1天爬虫的结果就是这样子咯

-------------------------------我是快乐的分割线-------------------------------

其实对于这两种爬取html的方法来说，最好结合在一起

作者测试过

方法二稳定性不足

方法一速度不好

所以自己改正

将方法一放到方法二的catch里面去

当方法二出现错误的时候就会用到方法一

但是当方法一也错误的时候就跳过吧

结合如下：

import org.jsoup.Jsoup;

import org.jsoup.nodes.Document;

import org.jsoup.nodes.Element;

import org.jsoup.select.Elements;

import java.io.BufferedInputStream;

import java.io.BufferedOutputStream;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.OutputStream;

import java.io.OutputStreamWriter;

import java.net.HttpURLConnection;

import java.net.URL;

import java.net.URLConnection;

import java.util.Date;

import java.text.SimpleDateFormat;

public class JavaSpider {

    //将抓取的网页变成html文件，保存在本地

    public static void Save_Html(String url) {

        try {

            File dest = new File("src/temp_html/" + "我是名字.html");

            //接收字节输入流

            InputStream is;

            //字节输出流

            FileOutputStream fos = new FileOutputStream(dest);

            URL temp = new URL(url);

            URLConnection uc = temp.openConnection();

            uc.addRequestProperty("User-Agent", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5");

            is = temp.openStream();

            //为字节输入流加缓冲

            BufferedInputStream bis = new BufferedInputStream(is);

            //为字节输出流加缓冲

            BufferedOutputStream bos = new BufferedOutputStream(fos);

            int length;

            byte[] bytes = new byte[1024*20];

            while((length = bis.read(bytes, 0, bytes.length)) != -1){

                fos.write(bytes, 0, length);

            }

            bos.close();

            fos.close();

            bis.close();

            is.close();

        } catch (IOException e) {

            e.printStackTrace();

            System.out.println("openStream流错误，跳转get流");

            //如果上面的那种方法解析错误

            //那么就用下面这一种方法解析

            try{

                Document doc = Jsoup.connect(url)

                .userAgent("Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; MALC)")

                .timeout(3000)

                .get();

                File dest = new File("src/temp_html/" + "我是名字.html");

                if(!dest.exists())

                    dest.createNewFile();

                FileOutputStream out=new FileOutputStream(dest,false);

                out.write(doc.toString().getBytes("utf-8"));

                out.close();

            }catch (IOException E) {

                E.printStackTrace();

                System.out.println("get流错误，请检查网址是否正确");

            }

        }

    }

    //解析本地的html

    public static void Get_Localhtml(String path) {

        //读取本地html的路径

        File file = new File(path);

        //生成一个数组用来存储这些路径下的文件名

        File[] array = file.listFiles();

        //写个循环读取这些文件的名字

        for(int i=0;i<array.length;i++){

            try{

                if(array[i].isFile()){

                    //文件名字

                    System.out.println("正在解析网址：" + array[i].getName());

                    //文件地址加文件名字

                    //System.out.println("#####" + array[i]);

                    //一样的文件地址加文件名字

                    //System.out.println("*****" + array[i].getPath()); 

                    //下面开始解析本地的html

                    Document doc = Jsoup.parse(array[i], "UTF-8");

                    //得到html的所有东西

                    Element content = doc.getElementById("content");

                    //分离出html下<a>...</a>之间的所有东西

                    Elements links = content.getElementsByTag("a");

                    //Elements links = doc.select("a[href]");

                    // 扩展名为.png的图片

                    Elements pngs = doc.select("img[src$=.png]");

                    // class等于masthead的div标签

                    Element masthead = doc.select("div.masthead").first();

                    for (Element link : links) {

                          //得到<a>...</a>里面的网址

                          String linkHref = link.attr("href");

                          //得到<a>...</a>里面的汉字

                          String linkText = link.text();

                          System.out.println(linkText);

                        }

                    }

                }catch (Exception e) {

                    System.out.println("网址：" + array[i].getName() + "解析出错");

                    e.printStackTrace();

                    continue;

                }

            }

        }

    //main函数

    public static void main(String[] args) {

        String url = "http://www.cnblogs.com/TTyb/";

        String path = "src/temp_html/";

        //保存到本地的网页地址

        Save_Html(url);

        //解析本地的网页地址

        Get_Localhtml(path);

    }

}

总的来说

java爬虫的方法比python的多好多

java的库真特么变态

巴特西

java从零到变身爬虫大神

最新文章

热门文章