我正在尝试运行我的第一个mapreduce作业,该作业聚合来自xml文件的一些数据。我的工作失败了,因为我是Hadoop的新手,如果有人能够看看出了什么问题,我将不胜感激。
我有:
posts_mapper.py:
#!/usr/bin/env python
import sys
import xml.etree.ElementTree as ET
input_string = sys.stdin.read()
class User(object):
def __init__(self, id):
self.id = id
self.post_type_1_count = 0
self.post_type_2_count = 0
self.aggregate_post_score = 0
self.aggregate_post_size = 0
self.tags_count = {}
users = {}
root = ET.fromstring(input_string)
for child in root.getchildren():
user_id = int(child.get("OwnerUserId"))
post_type = int(child.get("PostTypeId"))
score = int(child.get("Score"))
#view_count = int(child.get("ViewCount"))
post_size = len(child.get("Body"))
tags = child.get("Tags")
if user_id not in users:
users[user_id] = User(user_id)
user = users[user_id]
if post_type == 1:
user.post_type_1_count += 1
else:
user.post_type_2_count += 1
user.aggregate_post_score += score
user.aggregate_post_size += post_size
if tags != None:
tags = tags.replace("<", " ").replace(">", " ").split()
for tag in tags:
if tag not in user.tags_count:
user.tags_count[tag] = 0
user.tags_count[tag] += 1
for i in users:
user = users[i]
out = "%d %d %d %d %d " % (user.id, user.post_type_1_count, user.post_type_2_count, user.aggregate_post_score, user.aggregate_post_size)
for tag in user.tags_count:
out += "%s %d " % (tag, user.tags_count[tag])
print out
posts_reducer.py:
#!/usr/bin/env python
import sys
class User(object):
def __init__(self, id):
self.id = id
self.post_type_1_count = 0
self.post_type_2_count = 0
self.aggregate_post_score = 0
self.aggregate_post_size = 0
self.tags_count = {}
users = {}
for line in sys.stdin:
vals = line.split()
user_id = int(vals[0])
post_type_1 = int(vals[1])
post_type_2 = int(vals[2])
aggregate_post_score = int(vals[3])
aggregate_post_size = int(vals[4])
tags = {}
if len(vals) > 5:
#this means we got tags
for i in range (5, len(vals), 2):
tag = vals[i]
count = int((vals[i+1]))
tags[tag] = count
if user_id not in users:
users[user_id] = User(user_id)
user = users[user_id]
user.post_type_1_count += post_type_1
user.post_type_2_count += post_type_2
user.aggregate_post_score += aggregate_post_score
user.aggregate_post_size += aggregate_post_size
for tag in tags:
if tag not in user.tags_count:
user.tags_count[tag] = 0
user.tags_count[tag] += tags[tag]
for i in users:
user = users[i]
out = "%d %d %d %d %d " % (user.id, user.post_type_1_count, user.post_type_2_count, user.aggregate_post_score, user.aggregate_post_size)
for tag in user.tags_count:
out += "%s %d " % (tag, user.tags_count[tag])
print out
我运行命令:
bin/hadoop jar hadoop-streaming-2.6.0.jar -input /stackexchange/beer/posts -output /stackexchange/beer/results -mapper posts_mapper.py -reducer posts_reducer.py -file ~/mapreduce/posts_mapper.py -file ~/mapreduce/posts_reducer.py
并获得输出:
packageJobJar:[/home/hduser/mapreduce/posts_mapper.py,/home/hduser/mapreduce/posts_reducer.py,/ tmp / hadoop-unjar6585010774815976682 /] [] /tmp/streamjob8863638738687983603.jar tmpDir = null 15/03/20 10:18:55 INFO client.RMProxy:在Master / 10.1.1.22上连接到ResourceManager:8040 15/03/20 10:18:55 INFO client.RMProxy:在Master / 10.1.1.22上连接到ResourceManager:8040 15/03/20 10:18:57 INFO mapred.FileInputFormat:要处理的总输入路径:10 15/03/20 10:18:57 INFO mapreduce.JobSubmitter:分裂数:10 15/03/20 10:18:57 INFO mapreduce.JobSubmitter:提交工作代币:job_1426769192808_0004 15/03/20 10:18:58 INFO impl.YarnClientImpl:提交的应用程序application_1426769192808_0004 15/03/20 10:18:58 INFO mapreduce.Job:跟踪工作的网址:http://i-644dd931:8088/proxy/application_1426769192808_0004/ 15/03/20 10:18:58 INFO mapreduce.Job:正在运行的职位:job_1426769192808_0004 15/03/20 10:19:11 INFO mapreduce.Job:在uber模式下运行的job job_1426769192808_0004:false 15/03/20 10:19:11 INFO mapreduce.Job:地图0%减少0% 15/03/20 10:19:41 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000006_0,状态:未通过 15/03/20 10:19:48 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000007_0,状态:未通过 15/03/20 10:19:50 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000008_0,状态:未通过 15/03/20 10:19:50 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000009_0,状态:未通过 15/03/20 10:20:00 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000006_1,状态:未通过 15/03/20 10:20:08 INFO mapreduce.Job:地图7%减少0% 15/03/20 10:20:10 INFO mapreduce.Job:地图20%减少0% 15/03/20 10:20:10 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000007_1,状态:未通过 15/03/20 10:20:11 INFO mapreduce.Job:地图10%减少0% 15/03/20 10:20:17 INFO mapreduce.Job:地图20%减少0% 15/03/20 10:20:17 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000008_1,状态:未通过 15/03/20 10:20:19 INFO mapreduce.Job:地图10%减少0% 15/03/20 10:20:19 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000009_1,状态:未通过 15/03/20 10:20:22 INFO mapreduce.Job:地图20%减少0% 15/03/20 10:20:22 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000006_2,状态:未通过 15/03/20 10:20:25 INFO mapreduce.Job:地图40%减少0% 15/03/20 10:20:25 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000002_0,状态:未通过 错误:java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为1 在org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322) 在org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535) 在org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130) 在org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) 在org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) 在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450) 在org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) 在javax.security.auth.Subject.doAs(Subject.java:415) 在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) 在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
15/03/20 10:20:28 INFO mapreduce.Job:地图50%减少0% 15/03/20 10:20:28 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000007_2,状态:未通过 15/03/20 10:20:42 INFO mapreduce.Job:地图50%减少17% 15/03/20 10:20:52 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000008_2,状态:未通过 15/03/20 10:20:54 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000009_2,状态:未通过 15/03/20 10:20:56 INFO mapreduce.Job:地图90%减少0% 15/03/20 10:20:57 INFO mapreduce.Job:地图100%减少100% 15/03/20 10:20:58 INFO mapreduce.Job:作业job_1426769192808_0004因状态失败而失败,原因是:任务失败task_1426769192808_0004_m_000006 任务失败,作业失败。 failedMaps:1次失败减少:0
答案 0 :(得分:0)
不幸的是,hadoop没有为你的python mapper / reducer显示stderr
,所以这个输出没有提供任何线索。
我建议您执行以下2个故障排除步骤:
cat {your_input_files} | ./posts_mapper.py | sort | ./posts_reducer.py
yarn logs -applicationId application_1426769192808_0004
或
hdfs dfs -cat /var/log/hadoop-yarn/apps/{user}/logs/