我一直试图用nutch爬行。我正在使用他们的Java API进行以下配置 -
Configuration configuration = new Configuration();
configuration.set("plugin.folders", "data/plugins");
configuration.set("hadoop.tmp.dir","data/tmp");
configuration.set("db.ignore.external.links","false");
configuration.set("db.fetch.schedule.class","org.apache.nutch.crawl.DefaultFetchSchedule");
configuration.set("db.url.normalizers","false");
configuration.set("db.url.filters","false");
configuration.set("db.ignore.external.links.mode","byDomain");
configuration.set("db.fetch.retry.max","3");
configuration.set("db.ignore.external.exemptions.file","data/conf/db-ignore-external-exemptions.txt");
configuration.set("plugin.includes","urlnormalizer-(ajax|basic|host|pass|protocol|querystring|regex|slash)|nutch-extensionpoints");
configuration.set("plugin.auto-activation", "true");
configuration.set("fetcher.follow.outlinks.depth", "10");
configuration.set("fetcher.follow.outlinks.num.links", "100");
configuration.set("http.agent.name", "Nutch Crawler");
configuration.set("fetcher.max.crawl.delay", "15000");
configuration.set("http.timeout", String.valueOf(25000));
configuration.set("http.content.limit", String.valueOf(1024 * 1024 * 512));
我的代码是 -
public void crawlWithNutch() throws Exception{
CrawlerConfiguration crawlerConf = new CrawlerConfiguration();
Path crawlDB = crawlerConf.crawlDB();
Path urlDir = crawlerConf.urlDir();
System.out.println("Injecting urls to web table");
Injector injector = crawlerConf.injector();
injector.inject(crawlDB, urlDir);
Path segmentsDir = crawlerConf.segmentsDir();
Generator generator = crawlerConf.generator();
Fetcher fetcher = crawlerConf.fetcher();
ParseSegment parseSegment = crawlerConf.parseSegment();
CrawlDb crawlDb = crawlerConf.crawlDb();
FileDumper dumper = crawlerConf.dumper();
long segmentTime = new Date().getTime();
boolean stop = false;
int numberOfReduceTasks = 8;
int numberOfPagesInrun = 100;
int numberOfCores = 2;
while(!stop) {
System.out.println("Generator phase");
Path[] paths = generator.generate(crawlDB, segmentsDir, numberOfReduceTasks, numberOfPagesInrun, segmentTime);
if(paths!=null && paths.length>0) {
for (Path path : paths) {
System.out.println("Fetch: " + path);
fetcher.fetch(path, numberOfCores);
System.out.println("Parse: " + path);
parseSegment.parse(path);
}
System.out.println("Update: " + paths);
crawlDb.update(crawlDB, paths, true, true);
} else {
stop = true;
}
}
dumper.dump(new java.io.File("data/crawl/dump"),
new java.io.File("data/crawl/segments"),
tokenize("json"), true, true, false);
}
它在获取阶段抛出异常。例外是 -
13:34:00.564 [LocalJobRunner Map Task Executor #0] DEBUG org.apache.nutch.util.ObjectCache - No object cache found for conf=Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/home/ishan/IdeaProjects/crawler/data/tmp/mapred/local/localRunner/ishan/job_local1687861686_0004/job_local1687861686_0004.xml, instantiating a new object cache
13:34:00.570 [LocalJobRunner Map Task Executor #0] WARN org.apache.nutch.parse.ParsePluginsReader - Unable to parse [null].Reason is [java.net.MalformedURLException]
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:486)
at au.com.sensis.WebCrawler.crawlWithNutch(WebCrawler.java:217)
at au.com.sensis.WebCrawler.main(WebCrawler.java:46)
我已尽我所能。但我仍然无法让它发挥作用。任何帮助将不胜感激。我认为它将网址设为" null"因此,抛出这个例外。