使用nutch进行爬行IOException

时间:2018-04-17 08:06:37

标签: java web-crawler nutch ioexception

我一直试图用nutch爬行。我正在使用他们的Java API进行以下配置 -

Configuration configuration = new Configuration();
configuration.set("plugin.folders", "data/plugins");
configuration.set("hadoop.tmp.dir","data/tmp");
configuration.set("db.ignore.external.links","false");
configuration.set("db.fetch.schedule.class","org.apache.nutch.crawl.DefaultFetchSchedule");
configuration.set("db.url.normalizers","false");
configuration.set("db.url.filters","false");
configuration.set("db.ignore.external.links.mode","byDomain");
configuration.set("db.fetch.retry.max","3");
configuration.set("db.ignore.external.exemptions.file","data/conf/db-ignore-external-exemptions.txt");
configuration.set("plugin.includes","urlnormalizer-(ajax|basic|host|pass|protocol|querystring|regex|slash)|nutch-extensionpoints");
configuration.set("plugin.auto-activation", "true");
configuration.set("fetcher.follow.outlinks.depth", "10");
configuration.set("fetcher.follow.outlinks.num.links", "100");
configuration.set("http.agent.name", "Nutch Crawler");
configuration.set("fetcher.max.crawl.delay", "15000");
configuration.set("http.timeout", String.valueOf(25000));
configuration.set("http.content.limit", String.valueOf(1024 * 1024 * 512));

我的代码是 -

public void crawlWithNutch() throws Exception{

    CrawlerConfiguration crawlerConf = new CrawlerConfiguration();
    Path crawlDB = crawlerConf.crawlDB();
    Path urlDir = crawlerConf.urlDir();

    System.out.println("Injecting urls to web table");
    Injector injector = crawlerConf.injector();
    injector.inject(crawlDB, urlDir);

    Path segmentsDir = crawlerConf.segmentsDir();
    Generator generator = crawlerConf.generator();
    Fetcher fetcher = crawlerConf.fetcher();
    ParseSegment parseSegment = crawlerConf.parseSegment();
    CrawlDb crawlDb = crawlerConf.crawlDb();
    FileDumper dumper = crawlerConf.dumper();

    long segmentTime = new Date().getTime();

    boolean stop = false;

    int numberOfReduceTasks = 8;
    int numberOfPagesInrun = 100;
    int numberOfCores = 2;

    while(!stop) {

        System.out.println("Generator phase");
        Path[] paths = generator.generate(crawlDB, segmentsDir, numberOfReduceTasks, numberOfPagesInrun, segmentTime);
        if(paths!=null && paths.length>0) {

            for (Path path : paths) {
                System.out.println("Fetch: " + path);
                fetcher.fetch(path, numberOfCores);
                System.out.println("Parse: " + path);
                parseSegment.parse(path);
            }
            System.out.println("Update: " + paths);
            crawlDb.update(crawlDB, paths, true, true);

        } else {
            stop = true;
        }
    }

    dumper.dump(new java.io.File("data/crawl/dump"),
            new java.io.File("data/crawl/segments"),
            tokenize("json"), true, true, false);


}

它在获取阶段抛出异常。例外是 -

13:34:00.564 [LocalJobRunner Map Task Executor #0] DEBUG org.apache.nutch.util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/home/ishan/IdeaProjects/crawler/data/tmp/mapred/local/localRunner/ishan/job_local1687861686_0004/job_local1687861686_0004.xml, instantiating a new object cache
13:34:00.570 [LocalJobRunner Map Task Executor #0] WARN org.apache.nutch.parse.ParsePluginsReader - Unable to parse [null].Reason is [java.net.MalformedURLException]
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
    at au.com.sensis.WebCrawler.crawlWithNutch(WebCrawler.java:217)
    at au.com.sensis.WebCrawler.main(WebCrawler.java:46)

我已尽我所能。但我仍然无法让它发挥作用。任何帮助将不胜感激。我认为它将网址设为" null"因此,抛出这个例外。

0 个答案:

没有答案