使用Nutch进行爬网...显示IOException

时间:2012-06-22 22:27:11

标签: java open-source web-crawler nutch ioexception

我开始使用Nutch,一切都很好,直到遇到IOException异常,

$ ./nutch crawl urls -dir myCrawl -depth 2 -topN 4
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: myCrawl
rootUrlDir = urls
threads = 10
depth = 2
solrUrl=null
topN = 4
Injector: starting at 2012-06-23 03:37:51
Injector: crawlDb: myCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Rahul\mapred\staging\Rahul255889423\.staging to 0700
    at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:682)
    at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655)
    at      org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
    at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
    at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

@jeffery ---我降级了我的nutch版本n遇到了一个新问题,这超出了我的理解范围.... Plzz帮助......

$ ./nutch crawl urls -dir myCrawl -depth 4 -topN 5
cygpath: can't convert empty path
solrUrl is not set, indexing will be skipped...
crawl started in: myCrawl
root UrlDir = urls
threads = 10
depth = 4
solrUrl=null
topN = 5
Injector: starting at 2012-06-23 22:30:28
Injector: crawlDb: myCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

这个tym有什么问题?

1 个答案:

答案 0 :(得分:0)

我也在几天前遇到过这个问题。较新版本的Hadoop在与Windows交互时遇到了麻烦。您可以切换到* nix平台(您可能应该这样做,几乎所有对Nutch的支持都针对* nix用户)或降级您的Nutch版本。我发现在Windows Server 2008上运行的最新版Nutch是1.2