我正在使用Nutch 1.6抓取一些论坛并使用Solr 1.6.2对它们进行索引。我在Solr上运行了一个测试查询,并且惊讶地发现只有少数结果。我担心Nutch解析页面或Solr的索引存在问题。在窥探之后,我发现Nutch没有解析它检索到的很多页面:
bin/nutch readseg -list -dir crawl-mothering2/segments/
NAME GENERATED FETCHED PARSED
20130228001531 23 27 9
20130228003940 1430 1434 661
20130228001829 202 206 105
20130228061337 1068 1090 475
20130228091009 1 2 0
20130228085956 34 34 25
20130228090348 44 45 34
20130228090851 7 7 6
20130228080438 364 374 192
20130228030933 1774 1795 903
20130228084205 168 169 63
但是当我尝试解析片段时,我得到了这个:
bin/nutch parse crawl-mothering2/segments/*
ParseSegment: starting at 2013-03-21 00:20:43
ParseSegment: segment: crawl-mothering2/segments/20130228001531
Exception in thread "main" java.io.IOException: Segment already parsed!
at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216)
是什么给出了?
答案 0 :(得分:1)
如果要重新解析,请进入crawl / segments /和
rm -rf parse_text parse_data crawl_parse
然后你可以运行
bin/nutch parse crawldir/segments/<segmentnumber>
答案 1 :(得分:0)
Nutch无法重组细分。要解决此问题,您需要删除几个文件夹。请查看邮件列表讨论http://www.mail-archive.com/user@nutch.apache.org/msg09017.html。
获得更快的回复