Nutch 1.11抓取问题

时间:2016-01-19 11:46:39

标签: solr nutch

我已经按照教程配置了nutch,使用Cygwin在Windows 7上运行,并且我使用Solr 5.4.0来索引数据

但是nutch 1.11在执行爬行时遇到了问题。

抓取命令 $ bin / crawl -i -D solr.server.url = http://127.0.0.1:8983/solr / urls / TestCrawl 2

错误/异常

注入种子URL /apache-nutch-1.11/bin/nutch inject / TestCrawl / crawldb / urls     注射器:从2016-01-19 17:11:06开始     Injector:crawlDb:/ TestCrawl / crawldb     注入器:urlDir:/ urls     注入器:将注入的URL转换为爬网数据库条目。     Injector:java.lang.NullPointerException     在java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)     在org.apache.hadoop.util.Shell.runCommand(Shell.java:445)     在org.apache.hadoop.util.Shell.run(Shell.java:418)     在org.apache.hadoop.util.Shell $ ShellCommandExecutor.execute(Shell.java:650)     在org.apache.hadoop.util.Shell.execCommand(Shell.java:739)     在org.apache.hadoop.util.Shell.execCommand(Shell.java:722)     at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)     在org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)     在org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)     在org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)     在org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)     在org.apache.hadoop.mapreduce.Job $ 10.run(Job.java:1285)     在org.apache.hadoop.mapreduce.Job $ 10.run(Job.java:1282)     at java.security.AccessController.doPrivileged(Native Method)     在javax.security.auth.Subject.doAs(Subject.java:422)     在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)     在org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)     在org.apache.hadoop.mapred.JobClient $ 1.run(JobClient.java:562)     在org.apache.hadoop.mapred.JobClient $ 1.run(JobClient.java:557)     at java.security.AccessController.doPrivileged(Native Method)     在javax.security.auth.Subject.doAs(Subject.java:422)     在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)     在org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)     在org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)     在org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)     在org.apache.nutch.crawl.Injector.inject(Injector.java:323)     在org.apache.nutch.crawl.Injector.run(Injector.java:379)     在org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     在org.apache.nutch.crawl.Injector.main(Injector.java:369)

Error running:
/home/apache-nutch-1.11/bin/nutch inject /TestCrawl/crawldb /urls
Failed with exit value 127.

2 个答案:

答案 0 :(得分:1)

我可以看到您的命令存在多个问题,请尝试以下方法:

bin/crawl -i -Dsolr.server.url=http://127.0.0.1:8983/solr/core_name path_to_seed crawl 2

第一个问题是传递solr参数时有空格。第二个问题是solr url也应该包含核心名称。

答案 1 :(得分:0)

  使用nutch

时需要

select id,group_concat(concat(`name`) separator ',') as Result from teams group by group_id jar文件