我有一套hadoop流媒体工作,如下所示:
bash文件:
hadoop fs -rmr /tmp/someone/sentiment/
hadoop jar /opt/cloudera/parcels/CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar \
-input /user/hive/warehouse/tweetjsonsentilife20 \
-output /tmp/someone/sentiment/ \
-mapper "mapper_senti.py" \
-reducer "reducer_senti5.py" \
-file mapper_senti.py \
-file reducer_senti5.py \
-cmdenv dir1=http://ip-10-0-0-77.us-west-2.compute.internal:8088/home/ubuntu/aclImdb/train/pos \
-cmdenv dir2=http://ip-10-0-0-77.us-west-2.compute.internal:8088/home/ubuntu/aclImdb/train/pos/ \
映射器:
#!/usr/bin/env python
import sys
for line in sys.stdin:
keyval=line.strip().split("\t")
key,val=(keyval[0],keyval[0])
if key!="\N" and val!="\N":
sys.stdout.write('%s\t%s\n' % (key, val))
减速器:
#!/usr/bin/env python
import sys
import os
for line in sys.stdin:
keyval=line.strip().split("\t")
key,val=(keyval[0],keyval[1])
limit=1
dir1=os.environ['dir1']
dir2=os.environ['dir2']
for file in os.listdir(dir1)[:limit]:
for word in set(open(dir2+file).read()):
value=word
print "%s\t%s" % (key,value)
和错误:
rmr: DEPRECATED: Please use 'rm -r' instead.
14/06/18 22:10:57 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://ip-10-0-0-77.us-west-2.compute.internal:8020/tmp/natashac/sentiment' to trash at: hdfs://ip-10-0-0-77.us-west-2.compute.internal:8020/user/ubuntu/.Trash /Current
14/06/18 22:10:59 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper_senti.py, reducer_senti5.py] [/opt/cloudera/parcels /CDH-5.0.0-0.cdh5b2.p0.27/lib/hadoop-mapreduce/hadoop-streaming-2.2.0-cdh5.0.0-beta-2.jar] /tmp/streamjob8672448578858048676.jar tmpDir=null
14/06/18 22:11:00 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-77.us-west-2.compute.internal/10.0.0.77:8032
14/06/18 22:11:01 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-77.us-west-2.compute.internal/10.0.0.77:8032
14/06/18 22:11:01 INFO mapred.FileInputFormat: Total input paths to process : 1
14/06/18 22:11:01 INFO mapreduce.JobSubmitter: number of splits:2
14/06/18 22:11:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1402902347983_0058
14/06/18 22:11:02 INFO impl.YarnClientImpl: Submitted application application_1402902347983_0058
14/06/18 22:11:02 INFO mapreduce.Job: The url to track the job: http://ip-10-0-0-77.us- west-2.compute.internal:8088/proxy/application_1402902347983_0058/
14/06/18 22:11:02 INFO mapreduce.Job: Running job: job_1402902347983_0058
14/06/18 22:11:09 INFO mapreduce.Job: Job job_1402902347983_0058 running in uber mode : false
14/06/18 22:11:09 INFO mapreduce.Job: map 0% reduce 0%
14/06/18 22:11:16 INFO mapreduce.Job: map 100% reduce 0%
14/06/18 22:11:22 INFO mapreduce.Job: Task Id : attempt_1402902347983_0058_r_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:165)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
似乎语法一切正确。但没有奏效。似乎mapper是好的。但不是减速机。有人对此有任何想法吗?
谢谢!
答案 0 :(得分:0)
你认为dir1和dir2在哪里? reducer在YARN分配的某个节点上运行,它可能无法找到这些