Mongo-Hadoop流媒体

时间:2015-04-10 04:58:38

标签: mongodb hadoop hadoop-streaming

我是Mongodb和Hadoop的新手。我试图访问mongodb数据作为hadoop mapreduce作业的输入。我不太清楚如何指定用于从中获取数据的集合。这就是我的尝试:

hadoop jar/usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 
-input user/test/input/
-output user/test/output/
-inputformat com.mongodb.hadoop.mapred.MongoInputFormat
-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat
-io mongodb
-D mongo.input.uri=mongodb://localhost/my_dbs.collectionName 
-D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver 
-mapper /Users/wordcountMapper.py 
-reducer /Users/wordcountReducer.py 
-libjars /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/mongo-hadoop-streaming.jar

但我收到以下错误:

ERROR streaming.StreamJob: Unrecognized option: -D
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]

当我尝试这个时,我得到另一个错误:

 hadoop jar /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar 
-input user/input/ 
-output user/test/output 
-inputformat com.mongodb.hadoop.mapred.MongoInputFormat 
-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat 
-io mongodb -jobconf mongo.input.uri=mongodb://localhost/my_dbs.collectionName 
-jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver 
-mapper /Users/wordcountMapper.py 
-reducer /Users/wordcountReducer.py 
-libjars /usr/local/Cellar/hadoop/2.6.0/libexec/share/hadoop/tools/lib/mongo-hadoop-streaming.jar

`ERROR streaming.StreamJob: Unrecognized option: -libjars
Usage: $HADOOP_PREFIX/bin/hadoop jar hadoop-streaming.jar [options]`

请帮忙。

1 个答案:

答案 0 :(得分:1)

请查看this链接,了解如何将MongoDB连接到Hadoop。

编辑:

,或者

您可以直接在驱动程序中将其写为:

,而不是给jar使用-libjars选项。
args.add("-libjars");
args.add("/some/path/to/your/jar");
相关问题