我有一个5节点的hadoop集群,我可以在其上成功执行以下流媒体作业
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -input /sample/apat63_99.txt -output /foo1 -mapper 'wc -l' -numReduceTasks 0
但是当我尝试使用python执行流媒体作业时
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -input /sample/apat63_99.txt -output /foo5 -mapper 'AttributeMax.py 8' -file '/tmp/AttributeMax.py' -numReduceTasks 1
我收到错误
packageJobJar: [/tmp/AttributeMax.py, /tmp/hadoop-hdfs/hadoop-unjar2062240123197790813/] [] /tmp/streamjob4074525553604040275.jar tmpDir=null
14/08/29 11:22:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/29 11:22:58 INFO mapred.FileInputFormat: Total input paths to process : 1
14/08/29 11:22:59 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hdfs/mapred/local]
14/08/29 11:22:59 INFO streaming.StreamJob: Running job: job_201408272304_0030
14/08/29 11:22:59 INFO streaming.StreamJob: To kill this job, run:
14/08/29 11:22:59 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=jt1:8021 -kill job_201408272304_0030
14/08/29 11:22:59 INFO streaming.StreamJob: Tracking URL: http://jt1:50030/jobdetails.jsp?jobid=job_201408272304_0030
14/08/29 11:23:00 INFO streaming.StreamJob: map 0% reduce 0%
14/08/29 11:23:46 INFO streaming.StreamJob: map 100% reduce 100%
14/08/29 11:23:46 INFO streaming.StreamJob: To kill this job, run:
14/08/29 11:23:46 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=jt1:8021 -kill job_201408272304_0030
14/08/29 11:23:46 INFO streaming.StreamJob: Tracking URL: http://jt1:50030/jobdetails.jsp?jobid=job_201408272304_0030
14/08/29 11:23:46 ERROR streaming.StreamJob: Job not successful. Error: NA
14/08/29 11:23:46 INFO streaming.StreamJob: killJob...
在我的职业跟踪控制台中,我看到错误
java.io.IOException: log:null
R/W/S=2359/0/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=mapred
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Fri Aug 29 11:22:43 CDT 2014
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:282)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:110)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.Pipe
python代码本身很简单
#!/usr/bin/env python
import sys
index = int(sys.argv[1])
max = 0
for line in sys.stdin
fields = line.strip().split(",")
if fields[index].isdigit():
val = int(fields[index])
if (val > max):
max = val
else:
print max
答案 0 :(得分:0)
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar
-input /sample/cite75_99.txt
-output /foo
-mapper **'python RandomSample.py 10'**
-file RandomSale.py
-numReduceTasks 1