在Hadoop上运行wordcount R示例代码时出错

时间:2014-12-10 21:06:24

标签: r hadoop rhadoop

R wordcount示例代码:

library(rmr2) 
map <- function(k,lines) {
    words.list <- strsplit(lines, '\\s') 
    words <- unlist(words.list)
    return( keyval(words, 1) )
}
reduce <- function(word, counts) { 
    keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) { 
    mapreduce(input=input, output=output, input.format = "text", map=map, reduce=reduce)
}
system("/opt/hadoop/hadoop-2.5.1/bin/hadoop fs -rm -r /wordcount/out")
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data')
hdfs.out <- file.path(hdfs.root, 'out')

当我执行R代码的最后一个语句时,它会给出以下错误消息。

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

错误后,显示:

INFO mapreduce.Job:  map 100% reduce 100%

ERROR streaming.StreamJob: Job not Successful! Streaming Command Failed!

输出文件夹是在HDFS中创建的,但不会生成任何结果。知道可能导致问题的原因是什么?

更新1:

我找到了Hadoop为localhost:8042

中的特定作业提供的错误日志
Dec 11, 2014 3:26:38 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFO: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
Dec 11, 2014 3:26:40 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
Dec 11, 2014 3:26:40 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
Dec 11, 2014 3:26:43 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
Dec 11, 2014 3:26:45 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

任何人都知道这是什么问题?

更新2:

我在$ HADOOP_HOME / logs / userlogs / [application_id] / [container_id] / stderr找到了额外的日志信息:

...
Error : .onLoad failed in loadNamespace() for 'rhdfs', details:
call: fun(libname, pkgname)
  error: Environment variable HADOOP_CMD must be set before loading package rhdfs
Warning in FUN(c("base", "methods", "datasets", "utils", "grDevices", "graphics",  :
can't load rhdfs
Loading required package: rmr2
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
there is no package called ‘stringr’
...

1 个答案:

答案 0 :(得分:0)

在深入研究错误日志之后,似乎我已经在用户级安装了R库,我应该在系统级安装它。详细说明如何在系统级别上安装R库可以在thread上找到。(&#34; dev_tools&#34;软件包可以派上用场,记得在sudo下运行R,或者你可以更喜欢{{1 }})

您可以通过sudo R CMD INSTALL [package_name]仔细检查R中的软件包安装路径,但这始终显示软件包的第一个首选库路径。所以我强烈建议以前安装用户库。

再运行几次以仔细检查错误日志并确保在R系统lib中正确安装软件包。 system.file(package="[package_name]")日志很有帮助,但没有人在之前指出实际位置:-(