Apache Nutch 1.12的爬行问题

时间:2017-02-21 08:49:19

标签: apache solr web-crawler nutch

我是抓狂的新手。我正在使用https://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website 用nutch 1.12执行爬行。我在Windows上使用Cygwin进行了设置。

“bin / nutch”命令正常运行但是要抓取我做了以下更改 -

  1. 这是我的conf / nutch-site.xml文件
  2. <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
    	<property>
    	 <name>http.agent.name</name>
    	 <value>My Nutch Spider</value>
    	</property>
    </configuration>

    1. 这是我创建的urls / seed.txt文件的内容

      https://www.drugs.com/

    2. 现在我运行以下命令bin/nutch inject crawl/crawldb urls 我得到nullPointerException,如下所示

      MithL@DESKTOP-K3INBH0 /home/apache-nutch-1.12
      $ bin/nutch inject crawl/crawldb urls
      Injector: starting at 2017-02-21 14:03:51
      Injector: crawlDb: crawl/crawldb
      Injector: urlDir: urls
      Injector: Converting injected urls to crawl db entries.
      Injector: java.lang.NullPointerException
              at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
              at org.apache.hadoop.util.Shell.run(Shell.java:418)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
              at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
              at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
              at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
              at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
              at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
              at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
              at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
              at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
              at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
              at org.apache.nutch.crawl.Injector.run(Injector.java:467)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
              at org.apache.nutch.crawl.Injector.main(Injector.java:441)
      

      请建议我该做什么。谢谢

      !UPDATE!

      我将hadoop-core-1.2.1.jar添加到apache-nutch-1.12 / lib文件夹中,并将HADOOP_HOME环境变量设置为C:\winutils\bin\winutils.exe

      现在它提供了UnsupportedOperationException,如下所示

      $ bin/nutch inject crawl/crawldb urls
      Injector: starting at 2017-02-21 21:37:32
      Injector: crawlDb: crawl/crawldb
      Injector: urlDir: urls
      Injector: Converting injected urls to crawl db entries.
      Injector: java.lang.UnsupportedOperationException: Not implemented by the DistributedFileSystem FileSystem implementation
              at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:214)
              at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2365)
              at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
              at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
              at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
              at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
              at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
              at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
              at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
              at org.apache.nutch.crawl.Injector.inject(Injector.java:347)
              at org.apache.nutch.crawl.Injector.run(Injector.java:467)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
              at org.apache.nutch.crawl.Injector.main(Injector.java:441)
      

      请建议我该做什么。谢谢

0 个答案:

没有答案