Question

我需要使用Crawler4j从网站下载PDF。我正在关注this documentation创建两个类：

PDFCrawler
PDFCrawlController

现在，在我的PDFCrawler课程中，我有一个shouldVisit(Page page, WebURL url)方法，如下所示：

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase(); 
    return href.startsWith(crawlDomain) && pdfPatterns.matcher(href).matches();
}

此处，crawlDomain是从PDFCrawlController类发送的域名（例如http://www.example.com）。 pdfPatterns的定义如下：

private static final Pattern pdfPatterns = Pattern.compile(".*(\\.(pdf?))$");

visit(Page page)类中的PDFCrawler方法如下所示：

    public void visit(Page page) {
        String url = page.getWebURL().getURL();
        if (!pdfPatterns.matcher(url).matches()) {
            System.out.println("I am in " + url);
            System.out.println("No match. Leaving.");
            return;
        }
//and so on...

}

现在，当我向http://www.example.com发送PDFCrawler时，System.out.println()方法中的visit(Page page)打印如下：

I am in http://www.example.com/allforgood
No match. Leaving.
I am in http://www.another-web-site.iastate.edu/grants/xp2011-02
No match. Leaving.
I am in http://www.example.com/careers
No match. Leaving.
I am in http://www.example.com/wp-content/uploads/2014/01/image-happenings1.png
No match. Leaving.

我的问题是：

为什么抓取工具会转到another-web-site？我没有在shouldVisit()方法中限制它这样做吗？
为什么它访问来自同一域的实际上是图像的页面（例如png）？我没有在shouldVisit()方法中限制它这样做吗？

Answer 1

您的shouldVisit功能未被调用。它没有针对更新版本的正确声明。您正在关注该示例，但示例错误。

唯一的参数是URL。您可以在API here中看到它。

此外，当您使用@Override表示法时，您可以捕获这样的内容。 Java会告诉你，你实际上并没有覆盖你想要的东西。

需要澄清shouldVisit并访问Crawler4j的方法

1 个答案: