Java做爬虫也很方便

首先我们封装一个Http请求的工具类，用HttpURLConnection实现，也可以用HttpClient, 或者直接用Jsoup来请求。

工具类实现比较简单，就一个get方法，读取请求地址的响应内容，这边我们用来抓取网页的内容，没有使用代理，在真正的抓取过程中，当你大量请求某个网站的时候，对方会有一系列的策略来禁用你的请求，这个时候代理就排上用场了，通过代理设置不同的IP来抓取数据。

public class HttpUtils {

    public static String get(String url) {

        try {

            URL getUrl = new URL(url);

            HttpURLConnection connection = (HttpURLConnection) getUrl.openConnection();

            connection.setRequestMethod("GET");

            connection.setRequestProperty("Accept", "*/*");

            connection.setRequestProperty(

                    "User-Agent",

                    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; CIBA)");

            connection.setRequestProperty("Accept-Language", "zh-cn");

            connection.connect();

            BufferedReader reader = new BufferedReader(

                    new InputStreamReader(connection.getInputStream(), "utf-8"));

            String line;

            StringBuffer result = new StringBuffer();

            while ((line = reader.readLine()) != null){

                result.append(line);

            }

            reader.close();

            return result.toString();

        } catch (Exception e) {

            e.printStackTrace();

        }

        return null;

    }

}

接下来我们随便找一个有图片的网页，来试试抓取功能

public static List<String> getImageSrc(String html) {

        // 获取img标签正则

        String IMGURL_REG = "<img.*src=(.*?)[^>]*?>";

        // 获取src路径的正则

        String IMGSRC_REG = "http:\"?(.*?)(\"|>|\\s+)";

        Matcher matcher = Pattern.compile(IMGURL_REG).matcher(html);

        List<String> listImgUrl = new ArrayList<>();

        while (matcher.find()) {

            Matcher m = Pattern.compile(IMGSRC_REG).matcher(matcher.group());

            while (m.find()) {

                listImgUrl.add(m.group().substring(0, m.group().length() - 1));

            }

        }

        return listImgUrl;

    }

    public static void main(String[] args) {

        String url = "http://coder520.com/";

        String html = HttpUtils.get(url);

        List<String> imgUrls = getImageSrc(html);

        for (String imgSrc : imgUrls) {

            System.out.println(imgSrc);

        }

    }

首先将网页的内容抓取下来，然后用正则的方式解析出网页的标签，再解析img的地址。

执行程序我们可以得到下面的内容：

http://ophdr3ukd.bkt.clouddn.com/logo.png

http://ophdr3ukd.bkt.clouddn.com/SSM.jpg

http://ophdr3ukd.bkt.clouddn.com/%E5%8D%95%E8%BD%A6.jpg

通过上面的地址我们就可以将图片下载到本地了，下面我们写个图片下载的方法：

public static void main(String[] args) throws IOException {

        String url = "http://coder520.com/";

        String html = HttpUtils.get(url);

        List<String> imgUrls = getImageSrc(html);

        File dir = new File("img");

        if (!dir.exists()) {

            dir.mkdir();

        }

        for (String imgSrc : imgUrls) {

            System.out.println(imgSrc);

            String fileName = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);

            Files.copy(new URL(imgSrc).openStream(), Paths.get("img/" + fileName));

        }

    }

运行程序图片就被下载下来了

这样就很简单的实现了一个抓取并且提取图片的功能了，看起来还是比较麻烦哈，要写正则之类的，下面给大家介绍一种更简单的方式，如果你熟悉jQuery的话对提取元素就很简单了，这个框架就是Jsoup。

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

添加jsoup的依赖：

        <dependency>

            <groupId>org.jsoup</groupId>

            <artifactId>jsoup</artifactId>

            <version>1.11.3</version>

        </dependency>

使用jsoup之后提取的代码只需要简单的几行即可：

public static void main(String[] args) throws IOException {

//        String url = "http://coder520.com/";

//        String html = HttpUtils.get(url);

//        List<String> imgUrls = getImageSrc(html);

//

//        File dir = new File("img");

//        if (!dir.exists()) {

//            dir.mkdir();

//        }

//

//        for (String imgSrc : imgUrls) {

//            System.out.println(imgSrc);

//            String fileName = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);

//            Files.copy(new URL(imgSrc).openStream(), Paths.get("img/" + fileName));

//        }

        String url = "http://coder520.com/";

        String html = HttpUtils.get(url);

        File dir = new File("img");

        if (!dir.exists()) {

            dir.mkdir();

        }

        Document doc = Jsoup.parse(html);

        // 提取img标签

        Elements imgs = doc.getElementsByTag("img");

        for (Element img : imgs) {

            // 提取img标签的src属性

            String imgSrc = img.attr("src");

            if (imgSrc.startsWith("//")) {

                imgSrc = "http:" + imgSrc;

            }

            System.out.println(imgSrc);

            String fileName = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);

            Files.copy(new URL(imgSrc).openStream(), Paths.get("img/" + fileName));

        }

    }

通过Jsoup.parse创建一个文档对象，然后通过getElementsByTag的方法提取出所有的图片标签，循环遍历，通过attr方法获取图片的src属性,然后下载图片。

下面我们再来升级一下，做成一个小工具，提供一个简单的界面，输入一个网页地址，点击提取按钮，然后把图片自动下载下来，我们可以用swing写界面。

public class App {

    public static void main(String[] args) {

        JFrame frame = new JFrame();

        frame.setResizable(false);

        frame.setSize(425,400);

        frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

        frame.setLayout(null);

        frame.setLocationRelativeTo(null);

        JTextField jTextField = new JTextField();

        jTextField.setBounds(100, 44, 200, 30);

        frame.add(jTextField);

        JButton jButton = new JButton("提取");

        jButton.setBounds(140, 144, 100, 30);

        frame.add(jButton);

        frame.setVisible(true);

        jButton.addActionListener(new ActionListener() {

            @Override

            public void actionPerformed(ActionEvent e) {

                String url = jTextField.getText();

                if (url == null || url.equals("")) {

                    JOptionPane.showMessageDialog(null, "请填写抓取地址");

                    return;

                }

                File dir = new File("img");

                if (!dir.exists()) {

                    dir.mkdir();

                }

                String html = HttpUtils.get(url);

                Document doc = Jsoup.parse(html);

                Elements imgs = doc.getElementsByTag("img");

                for (Element img : imgs) {

                    String imgSrc = img.attr("src");

                    if (imgSrc.startsWith("//")) {

                        imgSrc = "http:" + imgSrc;

                    }

                    try {

                        System.out.println(imgSrc);

                        String fileName = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);

                        Files.copy(new URL(imgSrc).openStream(), Paths.get("img/" + fileName));

                    } catch (MalformedURLException e1) {

                        e1.printStackTrace();

                    } catch (IOException e1) {

                        e1.printStackTrace();

                    }

                }

                JOptionPane.showMessageDialog(null, "抓取完成");

            }

        });

    }

}

输入地址，点击提取按钮即可下载图片。

巴特西

Java做爬虫也很方便

jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

最新文章

热门文章