Windows上的Nutch:ERROR crawl.Injector

时间:2016-07-01 14:02:40

标签: windows hadoop cygwin nutch

我试图在基于cygwin64 2.874的Windows 2012服务器上安装nutch 1.12。由于java和linux的技能有限,我在https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Seeding_the_crawldb_with_a_list_of_URLs处逐步介绍。命令

 bin/nutch inject crawl/crawldb urls

因为无法找到winutils.exe而抛出错误。这是hadoop日志:

2016-07-01 09:22:25,660 ERROR util.Shell - Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
    at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:432)
    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:478)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:170)
    at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:153)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
    at org.apache.nutch.crawl.Injector.main(Injector.java:441)
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: starting at 2016-07-01 09:22:25
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: urlDir: urls
2016-07-01 09:22:25,832 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2016-07-01 09:22:26,050 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-01 09:22:26,394 ERROR crawl.Injector - Injector: java.lang.NullPointerException
    at java.lang.ProcessBuilder.start(Unknown Source)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
    at org.apache.hadoop.util.Shell.run(Shell.java:418)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
    at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
    at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
    at org.apache.nutch.crawl.Injector.run(Injector.java:467)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:441)

我在这里找到了几个提示,并从https://codeload.github.com/srccodes/hadoop-common-2.2.0-bin/zip/master下载了winutils.exe。我解压缩服务器上的文件夹并设置环境变量NUTCH_OPTS = -Dhadoop.home.dir = [winutils_folder]。现在winutils错误消失但是nutch调用失败并出现了不同的错误:

2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: starting at 2016-07-01 13:24:09
2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: urlDir: urls
2016-07-01 13:24:09,697 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2016-07-01 13:24:09,885 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-07-01 13:24:10,307 ERROR crawl.Injector - Injector: org.apache.hadoop.util.Shell$ExitCodeException: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
    at org.apache.hadoop.util.Shell.run(Shell.java:418)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
    at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:849)
    at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1149)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:58)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:357)
    at org.apache.nutch.crawl.Injector.run(Injector.java:467)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:441)

更新.bashrc(添加以下行)后,hadoop日志仅显示警告。

export JAVA_HOME='/cygdrive/c/Program Files/Java/jre1.8.0_92'
export PATH=$PATH:$JAVA_HOME/bin

但是荷奇仍然会犯错:

Injector: starting at 2016-07-01 17:25:22
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:570)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
        at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:173)
        at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:160)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:94)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
        at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:131)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
        at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:376)
        at org.apache.nutch.crawl.Injector.run(Injector.java:467)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.crawl.Injector.main(Injector.java:441)

我需要提示可能配置错误或者是否可以使用windows / cygwin运行nutch?

2 个答案:

答案 0 :(得分:0)

我花了大约一个星期来找出导致UnsatisfiedLinkError的原因。

我认为主要原因是java无法找到它的依赖关系。

我通过在nutch文件中注释掉一些脚本解决了这个问题。

(位于{nutch_folder} \ runtime \ local \ bin)

更具体地说,我注释了脚本设置-Djava.library.path。

我将 java程序改为使用原始java.library.path

使用以下命令检查环境中的原始java.libary.path。

java -XshowSettings:properties -version

它将显示许多属性,包括java.library.path = ...

注释掉以下代码:

# setup 'java.library.path' for native-hadoop code if necessary
# used only in local mode
JAVA_LIBRARY_PATH=''
if [ -d "${NUTCH_HOME}/lib/native" ]; then

  JAVA_PLATFORM=`"${JAVA}" -classpath "$CLASSPATH" org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`

  if [ -d "${NUTCH_HOME}/lib/native" ]; then
    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
      JAVA_LIBRARY_PATH="${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}"
    else
      JAVA_LIBRARY_PATH="${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}"
    fi
  fi
fi

if [ $cygwin = true -a "X${JAVA_LIBRARY_PATH}" != "X" ]; then
   JAVA_LIBRARY_PATH="`cygpath -p -w "$JAVA_LIBRARY_PATH"`"
fi

还会在下面注明代码:

if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
  NUTCH_OPTS=("${NUTCH_OPTS[@]}" -Djava.library.path="$JAVA_LIBRARY_PATH")
fi

这样做NUTCH_OPTS将不包含-Djava.library.path,因此java将使用原始设置。

nutch命令在(伪)分配模式下运行。 我不知道究竟是为什么。

希望这有帮助。

答案 1 :(得分:0)

无需在小节脚本中注释任何内容。 如果已将hadoop.dll和winutils.exe放在HADOOP_HOME的bin文件夹下,请确保bin文件夹的路径也附加到Windows PATH env变量中。