我正在尝试使用带有mongo-hadoop和python的hadoop流媒体。 从mongodb集合中读取作品,写作没有。 如下所示,作业成功运行,但输出集合保持空白。
我尝试了prebuild 1.4.0 jar和mongo-hadoop的最新git快照(1.4.1)。 Hadoop Distribution是带有HDP 2.2.4.2的Hortonworks Sandbox,但HDP 2.3也不起作用。
mongo-hadoop维基略显过时,因此我不确定是否使用了正确的参数,遗漏了某些内容或观察到了错误。
$ cat run_python.sh
#!/bin/bash
set -x
export LIBJARS="/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar","/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar","/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar"
su hdfs - -m -c "hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar \
-files /home/hdfs/example/video/python/mapper.py,/home/hdfs/example/video/python/reducer.py \
-D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver \
-D mongo.auth.uri=mongodb://hadoop:password@127.0.0.1:27017/admin \
-D mongo.input.uri=mongodb://hadoop:password@127.0.0.1:27017/hadoop.in \
-D mongo.output.uri=mongodb://hadoop:password@127.0.0.1:27017/hadoop.out \
-D mongo.job.verbose=true \
-libjars ${LIBJARS} \
-input /tmp/in \
-output /tmp/out \
-io mongodb \
-inputformat com.mongodb.hadoop.mapred.MongoInputFormat \
-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat \
-mapper mapper.py \
-reducer reducer.py"
输出
[root@sandbox python]# ./run_python.sh
+ export LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ su hdfs - -m -c 'hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar -files /home/hdfs/example/video/python/mapper.py,/home/hdfs/example/video/python/reducer.py -D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver -D mongo.auth.uri=mongodb://hadoop:password@127.0.0.1:27017/admin -D mongo.input.uri=mongodb://hadoop:password@127.0.0.1:27017/hadoop.in -D mongo.output.uri=mongodb://hadoop:password@127.0.0.1:27017/hadoop.out -D mongo.job.verbose=true -libjars /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar -input /tmp/in -output /tmp/out -io mongodb -inputformat com.mongodb.hadoop.mapred.MongoInputFormat -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat -mapper mapper.py -reducer reducer.py'
packageJobJar: [] [/usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.2-2.jar] /tmp/streamjob7732112681113565020.jar tmpDir=null
15/09/24 13:38:38 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:38 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:39 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:39 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:41 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:41 INFO driver.cluster: No server chosen by PrimaryServerSelector from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:41 INFO driver.connection: Opened connection [connectionId{localValue:1, serverValue:1358}] to 127.0.0.1:27017
15/09/24 13:38:41 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=28894677}
15/09/24 13:38:42 INFO driver.connection: Opened connection [connectionId{localValue:2, serverValue:1359}] to 127.0.0.1:27017
15/09/24 13:38:42 INFO splitter.MongoSplitterFactory: Retrieved Collection stats:{ "ns" : "hadoop.in" , "count" : 100 , "size" : 148928 , "avgObjSize" : 1489 , "numExtents" : 3 , "storageSize" : 172032 , "lastExtentSize" : 131072.0 , "paddingFactor" : 1.0 , "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only." , "userFlags" : 1 , "capped" : false , "nindexes" : 1 , "indexDetails" : { } , "totalIndexSize" : 8176 , "indexSizes" : { "_id_" : 8176} , "ok" : 1.0}
15/09/24 13:38:42 INFO driver.connection: Closed connection [connectionId{localValue:2, serverValue:1359}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:42 INFO mapred.MongoInputFormat: Using com.mongodb.hadoop.splitter.StandaloneMongoSplitter@1a43c7a0 to calculate splits. (old mapreduce API)
15/09/24 13:38:42 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:42 INFO splitter.StandaloneMongoSplitter: Running splitvector to check splits against mongodb://hadoop:password@127.0.0.1:27017/hadoop.in
15/09/24 13:38:42 INFO driver.cluster: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:42 INFO driver.connection: Opened connection [connectionId{localValue:3, serverValue:1360}] to 127.0.0.1:27017
15/09/24 13:38:42 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=27903847}
15/09/24 13:38:42 INFO driver.connection: Opened connection [connectionId{localValue:4, serverValue:1361}] to 127.0.0.1:27017
15/09/24 13:38:42 WARN splitter.StandaloneMongoSplitter: WARNING: No Input Splits were calculated by the split code. Proceeding with a *single* split. Data may be too small, try lowering 'mongo.input.split_size' if this is undesirable.
15/09/24 13:38:42 INFO splitter.MongoCollectionSplitter: Created split: min=null, max= null
15/09/24 13:38:42 INFO driver.connection: Closed connection [connectionId{localValue:4, serverValue:1361}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:43 INFO mapreduce.JobSubmitter: number of splits:1
15/09/24 13:38:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443100485659_0008
15/09/24 13:38:44 INFO impl.YarnClientImpl: Submitted application application_1443100485659_0008
15/09/24 13:38:44 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1443100485659_0008/
15/09/24 13:38:44 INFO mapreduce.Job: Running job: job_1443100485659_0008
15/09/24 13:38:52 INFO mapreduce.Job: Job job_1443100485659_0008 running in uber mode : false
15/09/24 13:38:52 INFO mapreduce.Job: map 0% reduce 0%
15/09/24 13:39:01 INFO mapreduce.Job: map 100% reduce 0%
15/09/24 13:39:09 INFO mapreduce.Job: map 100% reduce 100%
15/09/24 13:39:09 INFO mapreduce.Job: Job job_1443100485659_0008 completed successfully
15/09/24 13:39:10 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=6506
FILE: Number of bytes written=257301
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=376
HDFS: Number of bytes written=3000
HDFS: Number of read operations=3
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5865
Total time spent by all reduces in occupied slots (ms)=5166
Total time spent by all map tasks (ms)=5865
Total time spent by all reduce tasks (ms)=5166
Total vcore-seconds taken by all map tasks=5865
Total vcore-seconds taken by all reduce tasks=5166
Total megabyte-seconds taken by all map tasks=1466250
Total megabyte-seconds taken by all reduce tasks=1291500
Map-Reduce Framework
Map input records=100
Map output records=100
Map output bytes=6300
Map output materialized bytes=6506
Input split bytes=376
Combine input records=0
Combine output records=0
Reduce input groups=100
Reduce shuffle bytes=6506
Reduce input records=100
Reduce output records=100
Spilled Records=200
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=152
CPU time spent (ms)=2150
Physical memory (bytes) snapshot=295743488
Virtual memory (bytes) snapshot=1995943936
Total committed heap usage (bytes)=262909952
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
15/09/24 13:39:10 INFO streaming.StreamJob: Output directory: /tmp/out
使用相同的脚本并将输出存储为bson。但
[root@sandbox python]# ./run_python_bson_output.sh
+ export LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ su hdfs - -m -c 'hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar -files /home/hdfs/example/video/python/mapper.py,/home/hdfs/example/video/python/reducer.py -D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver -D mongo.auth.uri=mongodb://hadoop:password@127.0.0.1:27017/admin -D mongo.input.uri=mongodb://127.0.0.1:27017/hadoop.in -D mongo.job.verbose=true -D mapreduce.output.fileoutputformat.outputdir=/tmp/output.bson -libjars /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar -input /tmp/in -output /tmp/videos_streaming -io mongodb -inputformat com.mongodb.hadoop.mapred.MongoInputFormat -outputformat com.mongodb.hadoop.mapred.BSONFileOutputFormat -mapper mapper.py -reducer reducer.py'
packageJobJar: [] [/usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.2-2.jar] /tmp/streamjob3257949526000997018.jar tmpDir=null
15/09/24 13:38:00 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:00 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:00 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:00 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:01 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:01 INFO driver.cluster: No server chosen by PrimaryServerSelector from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:02 INFO driver.connection: Opened connection [connectionId{localValue:1, serverValue:1352}] to 127.0.0.1:27017
15/09/24 13:38:02 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=24906864}
15/09/24 13:38:02 INFO driver.connection: Opened connection [connectionId{localValue:2, serverValue:1353}] to 127.0.0.1:27017
15/09/24 13:38:02 INFO splitter.MongoSplitterFactory: Retrieved Collection stats:{ "ns" : "hadoop.in" , "count" : 100 , "size" : 148928 , "avgObjSize" : 1489 , "numExtents" : 3 , "storageSize" : 172032 , "lastExtentSize" : 131072.0 , "paddingFactor" : 1.0 , "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only." , "userFlags" : 1 , "capped" : false , "nindexes" : 1 , "indexDetails" : { } , "totalIndexSize" : 8176 , "indexSizes" : { "_id_" : 8176} , "ok" : 1.0}
15/09/24 13:38:02 INFO driver.connection: Closed connection [connectionId{localValue:2, serverValue:1353}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:02 INFO mapred.MongoInputFormat: Using com.mongodb.hadoop.splitter.StandaloneMongoSplitter@6e2cc310 to calculate splits. (old mapreduce API)
15/09/24 13:38:02 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:02 INFO splitter.StandaloneMongoSplitter: Running splitvector to check splits against mongodb://127.0.0.1:27017/hadoop.in
15/09/24 13:38:02 INFO driver.cluster: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:02 INFO driver.connection: Opened connection [connectionId{localValue:3, serverValue:1354}] to 127.0.0.1:27017
15/09/24 13:38:02 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=32114805}
15/09/24 13:38:03 INFO driver.connection: Opened connection [connectionId{localValue:4, serverValue:1355}] to 127.0.0.1:27017
15/09/24 13:38:03 WARN splitter.StandaloneMongoSplitter: WARNING: No Input Splits were calculated by the split code. Proceeding with a *single* split. Data may be too small, try lowering 'mongo.input.split_size' if this is undesirable.
15/09/24 13:38:03 INFO splitter.MongoCollectionSplitter: Created split: min=null, max= null
15/09/24 13:38:03 INFO driver.connection: Closed connection [connectionId{localValue:4, serverValue:1355}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:03 INFO mapreduce.JobSubmitter: number of splits:1
15/09/24 13:38:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443100485659_0007
15/09/24 13:38:03 INFO impl.YarnClientImpl: Submitted application application_1443100485659_0007
15/09/24 13:38:03 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1443100485659_0007/
15/09/24 13:38:03 INFO mapreduce.Job: Running job: job_1443100485659_0007
15/09/24 13:38:12 INFO mapreduce.Job: Job job_1443100485659_0007 running in uber mode : false
15/09/24 13:38:12 INFO mapreduce.Job: map 0% reduce 0%
15/09/24 13:38:20 INFO mapreduce.Job: map 100% reduce 0%
15/09/24 13:38:28 INFO mapreduce.Job: map 100% reduce 100%
15/09/24 13:38:28 INFO mapreduce.Job: Job job_1443100485659_0007 completed successfully
15/09/24 13:38:28 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=6506
FILE: Number of bytes written=256757
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=336
HDFS: Number of bytes written=3600
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=6144
Total time spent by all reduces in occupied slots (ms)=5032
Total time spent by all map tasks (ms)=6144
Total time spent by all reduce tasks (ms)=5032
Total vcore-seconds taken by all map tasks=6144
Total vcore-seconds taken by all reduce tasks=5032
Total megabyte-seconds taken by all map tasks=1536000
Total megabyte-seconds taken by all reduce tasks=1258000
Map-Reduce Framework
Map input records=100
Map output records=100
Map output bytes=6300
Map output materialized bytes=6506
Input split bytes=336
Combine input records=0
Combine output records=0
Reduce input groups=100
Reduce shuffle bytes=6506
Reduce input records=100
Reduce output records=100
Spilled Records=200
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=177
CPU time spent (ms)=2220
Physical memory (bytes) snapshot=296923136
Virtual memory (bytes) snapshot=1996275712
Total committed heap usage (bytes)=262746112
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=3600
15/09/24 13:38:28 INFO streaming.StreamJob: Output directory: /tmp/videos_streaming
即使将输出的bson恢复为mongodb也可以。
答案 0 :(得分:3)
这是一个错误,并在1.4.1版本中得到修复。 见https://github.com/mongodb/mongo-hadoop/commit/766922b656d11fd5e661eecb0cc370ba3f86b0d4
在这种情况下添加
"-D mapred.output.committer.class=com.mongodb.hadoop.mapred.output.MongoOutputCommitter"
导致期望的结果。