Question

download all images from a website using wget非常容易。

但我需要在客户端使用此功能，最好是在Java中。

我知道wget的源代码可以在线访问，但我不知道任何C，而且源代码非常复杂。当然，wget还有其他功能为我“炸毁了源头”。

由于Java有一个内置的HttpClient，但我不知道wget到底有多复杂，你能不能告诉我是否很难重新实现Java中的“递归下载所有图像”功能？

这是怎么做到的？ wget是否获取给定URL的HTML源代码，从HTML中提取具有给定文件结尾（.jpg，.png）的所有URL并下载它们？它是否还在该HTML文档中链接的样式表中搜索图像？

你会怎么做？您是否会使用正则表达式在HTML文档中搜索（相对和绝对）图像URL，并让HttpClient下载每个图像URL？或者是否已经有一些类似的Java库？

Answer 1

在Java中，您可以使用Jsoup库来解析任何网页并提取您想要的任何内容

Answer 2

对我来说， crawler4j 是用于递归抓取（和复制）网站的开源库，例如像这样（他们的QuickStart示例）：（它还 supports CSS URL crawling ）

public class MyCrawler extends WebCrawler {

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp3|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "http://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu/");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);

         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
         }
    }
}

可以找到更多的webcrawler和HTML解析器here。

Answer 3

找到下载图片的this program。它是开源的。

您可以使用<IMG>标记在网站上获取图片。请看下面的问题。它可能对你有帮助。 Get all Images from WebPage Program | Java

在客户端下载像wget一样的所有图像

3 个答案: