我正在尝试使用Amazon EMR上的Rstudio中的rmr包将数据帧从R写入HDFS。 我正在关注的教程是 http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR
我写的代码是
Sys.setenv(HADOOP_CMD="/home/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/home/hadoop/contrib/streaming/hadoop-streaming.jar")
Sys.setenv(JAVA_HOME="/usr/java/latest/jre")
# load librarys
library(rmr2)
library(rhdfs)
library(plyrmr)
# initiate rhdfs package
hdfs.init()
# a very simple plyrmr example to test the package
library(plyrmr)
# running code localy
bind.cols(mtcars, carb.per.cyl = carb/cyl)
# same code on Hadoop cluster
to.dfs(mtcars, output="/tmp/mtcars")
我正在关注此代码教程: https://github.com/awslabs/emr-bootstrap-actions/blob/master/R/Hadoop/examples/biganalyses_example.R
Tha hadoop版本是Cloudera CDH5。我还适当地设置了环境变量。
在运行Above代码时,我收到以下错误:
> to.dfs(data,output="/tmp/cust_seg")
15/03/09 20:00:21 ERROR streaming.StreamJob: Missing required options: input, output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-verbose
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
Streaming Job Failed!
我无法弄清楚这个问题的解决方案。如果有人拄着拐杖快速帮助,我将不胜感激。
答案 0 :(得分:0)
由于代码中未设置HADOOP_STREAMING
环境变量,导致错误。您应该指定完整路径以及jar文件名。下面的R代码对我来说很好。
R代码(我正在使用hadoop 2.4.0)
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")
# load librarys
library(rmr2)
library(rhdfs)
# initiate rhdfs package
hdfs.init()
# a very simple plyrmr example to test the package
library(plyrmr)
# running code localy
bind.cols(mtcars, carb.per.cyl = carb/cyl)
# same code on Hadoop cluster
to.dfs(mtcars, output="/tmp/mtcars")
# list the files of tmp folder
hdfs.ls("/tmp")
permission owner group size modtime file
1 -rw-r--r-- manohar supergroup 1685 2015-03-22 16:12 /tmp/mtcars
希望这有帮助。