如何使用crawler4j进行刮擦?

时间:2014-11-07 11:14:42

标签: java windows crawler4j

我现在已经4个小时了,我根本看不出自己做错了什么。我有两个文件:

  1. MyCrawler.java
  2. Controller.java
  3. MyCrawler.java

    import edu.uci.ics.crawler4j.crawler.Page;
    import edu.uci.ics.crawler4j.crawler.WebCrawler;
    import edu.uci.ics.crawler4j.parser.HtmlParseData;
    import edu.uci.ics.crawler4j.url.WebURL;
    import java.util.List;
    import java.util.regex.Pattern;
    import org.apache.http.Header;
    
    public class MyCrawler extends WebCrawler {
    
        private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
                        + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
    
        /**
         * You should implement this function to specify whether the given url
         * should be crawled or not (based on your crawling logic).
         */
        @Override
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();
                return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
        }
    
        /**
         * This function is called when a page is fetched and ready to be processed
         * by your program.
         */
        @Override
        public void visit(Page page) {
                int docid = page.getWebURL().getDocid();
                String url = page.getWebURL().getURL();
                String domain = page.getWebURL().getDomain();
                String path = page.getWebURL().getPath();
                String subDomain = page.getWebURL().getSubDomain();
                String parentUrl = page.getWebURL().getParentUrl();
                String anchor = page.getWebURL().getAnchor();
    
                System.out.println("Docid: " + docid);
                System.out.println("URL: " + url);
                System.out.println("Domain: '" + domain + "'");
                System.out.println("Sub-domain: '" + subDomain + "'");
                System.out.println("Path: '" + path + "'");
                System.out.println("Parent page: " + parentUrl);
                System.out.println("Anchor text: " + anchor);
    
                if (page.getParseData() instanceof HtmlParseData) {
                        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                        String text = htmlParseData.getText();
                        String html = htmlParseData.getHtml();
                        List<WebURL> links = htmlParseData.getOutgoingUrls();
    
                        System.out.println("Text length: " + text.length());
                        System.out.println("Html length: " + html.length());
                        System.out.println("Number of outgoing links: " + links.size());
                }
    
                Header[] responseHeaders = page.getFetchResponseHeaders();
                if (responseHeaders != null) {
                        System.out.println("Response headers:");
                        for (Header header : responseHeaders) {
                                System.out.println("\t" + header.getName() + ": " + header.getValue());
                        }
                }
    
                System.out.println("=============");
        }
    }
    

    Controller.java

    package edu.crawler;
    
    import edu.uci.ics.crawler4j.crawler.Page;
    import edu.uci.ics.crawler4j.crawler.WebCrawler;
    import edu.uci.ics.crawler4j.parser.HtmlParseData;
    import edu.uci.ics.crawler4j.url.WebURL;
    import java.util.List;
    import java.util.regex.Pattern;
    
    import org.apache.http.Header;
    
    import edu.uci.ics.crawler4j.crawler.CrawlConfig;
    import edu.uci.ics.crawler4j.crawler.CrawlController;
    import edu.uci.ics.crawler4j.fetcher.PageFetcher;
    import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
    import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
    
    public class Controller 
    {
    
        public static void main(String[] args) throws Exception 
        {
                String crawlStorageFolder = "../data/";
                int numberOfCrawlers = 7;
    
                CrawlConfig config = new CrawlConfig();
                config.setCrawlStorageFolder(crawlStorageFolder);
    
                /*
                 * Instantiate the controller for this crawl.
                 */
                PageFetcher pageFetcher = new PageFetcher(config);
                RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
                RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
                CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    
                /*
                 * For each crawl, you need to add some seed urls. These are the first
                 * URLs that are fetched and then the crawler starts following links
                 * which are found in these pages
                 */
                controller.addSeed("http://www.ics.uci.edu/~welling/");
                controller.addSeed("http://www.ics.uci.edu/~lopes/");
                controller.addSeed("http://www.ics.uci.edu/");
    
                /*
                 * Start the crawl. This is a blocking operation, meaning that your code
                 * will reach the line after this only when crawling is finished.
                 */
                controller.start(MyCrawler, numberOfCrawlers);
        }
    }
    

    结构如下:

    java/MyCrawler.java
    java/Controller.java
    jars/... --> all the jars crawler4j
    

    我尝试使用以下命令在WINDOWS机器上编译它:

    javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" MyCrawler.java
    

    这很有效,我最终得到了:

    java/MyCrawler.class
    

    然而,当我输入:

    javac -cp "C:\xampp\htdocs\crawlcrowd\www\java\jars\*;C:\xampp\htdocs\crawlcrowd\www\java\*" Controller.java
    
    它爆炸了:

    Controller.java:50: error: cannot find symbol
                controller.start(MyCrawler, numberOfCrawlers);
                                 ^
      symbol:   variable MyCrawler
      location: class Controller
    1 error
    

    所以,我觉得我不是在做一些我需要做的事情。将使这个新的可执行类成为&#34;意识到&#34; MyCrawler.class的。我试过在命令行javac部分中摆弄类路径。我也试过在我的环境变量中设置它......没有运气。

    知道如何才能让它发挥作用吗?

    更新

    我从Google代码页本身获得了大部分代码。但我无法弄明白必须去哪里。即使我试试这个:

    MyCrawler mc = new MyCrawler();
    

    没有运气。不知何故,Controller.class不了解MyCrawler.class。

    更新2

    我不认为这很重要,因为问题显然是它无法找到课程,但无论哪种方式,这里都是&#34; CrawlController控制器&#34;的签名。取自here

       /**
         * Start the crawling session and wait for it to finish.
         * 
         * @param _c
         *            the class that implements the logic for crawler threads
         * @param numberOfCrawlers
         *            the number of concurrent threads that will be contributing in
         *            this crawling session.
         */
        public <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers) {
                this.start(_c, numberOfCrawlers, true);
        }
    

    我实际上正在通过一个&#34;爬虫&#34;因为我正在传递&#34; MyCrawler&#34;。问题是应用程序不知道MyCrawler是什么。

3 个答案:

答案 0 :(得分:1)

有几件事情浮现在脑海中:

  1. 你的MyCrawler是否扩展了edu.uci.ics.crawler4j.crawler.WebCrawler?

    public class MyCrawler extends WebCrawler
    
  2. 您是否将MyCrawler.class(即作为类)传入controller.start?

    controller.start(MyCrawler.class, numberOfCrawlers);
    
  3. 为了让控制器编译和运行,需要满足这两个要求。此外,Crawler4j在这里有一些很好的例子:

    https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawler.java

    https://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/basic/BasicCrawlController.java

    这两个类将立即编译并运行(即BasicCrawlController),因此如果您遇到任何问题,它将是一个很好的起点。

答案 1 :(得分:0)

start()的参数应该是爬网程序的类和数量。当您传入搜寻器对象而不是搜寻器类时,它会引发错误。使用如下所示的启动方法,它应该可以工作

controller.start(MyCrawler.class, numberOfCrawlers)

答案 2 :(得分:-1)

在这里,您要传递一个类名MyCrawler作为参数。

controller.start(MyCrawler, numberOfCrawlers);

我认为类名不应该是参数。

我也在爬行一点点工作!