nutch1.14重复数据删除失败

时间:2018-01-09 12:21:30

标签: hadoop solr web-crawler bigdata nutch

我在CentOS Linux版本7.3.1611中集成了nutch 1.14和solr-6.6.0我在seedlist中提供了大约10个url,位于/usr/local/apache-nutch-1.13/urls/ seed.txt我跟着tutorial

[root@localhost apache-nutch-1.14]# bin/nutch dedup http://ip:8983/solr/
DeduplicationJob: starting at 2018-01-09 15:07:52
DeduplicationJob: java.io.IOException: No FileSystem for scheme: http
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:329)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:320)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575)
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:561)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:870)
    at org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:326)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:369)

所有与solr相关的命令都有效。请帮忙。 他们在nutch教程中讨论的hadoop元素在哪里。我们是否必须为hadoop,nutch和solr安装除java以外的任何东西来共同构建搜索引擎?

2 个答案:

答案 0 :(得分:0)

试试这个

bin/nutch dedup -Dsolr.server.url=http://ip:8983/solr/

答案 1 :(得分:0)

我正在阅读同一指南并遇到同样的问题。这可能会有所帮助:

(Step-by-Step: Deleting Duplicates)  
$ bin/nutch dedup crawl/crawldb/ -Dsolr.server.url=http://localhost:8983/solr/nutch

DeduplicationJob: starting at 2018-02-23 14:27:34  
Deduplication: 1 documents marked as duplicates  
Deduplication: Updating status of duplicate urls into crawl db.  
Deduplication finished at 2018-02-23 14:27:37, elapsed: 00:00:03