Hadoop Streaming作业失败

时间:2015-03-20 10:28:52

标签: hadoop

我正在尝试运行我的第一个mapreduce作业,该作业聚合来自xml文件的一些数据。我的工作失败了,因为我是Hadoop的新手,如果有人能够看看出了什么问题,我将不胜感激。

我有:

posts_mapper.py:

#!/usr/bin/env python

import sys
import xml.etree.ElementTree as ET

input_string = sys.stdin.read()


class User(object):

        def __init__(self, id):
                self.id = id
                self.post_type_1_count = 0
                self.post_type_2_count = 0
                self.aggregate_post_score = 0
                self.aggregate_post_size = 0
                self.tags_count = {}


users = {}
root = ET.fromstring(input_string)
for child in root.getchildren():
        user_id = int(child.get("OwnerUserId"))
        post_type = int(child.get("PostTypeId"))
        score = int(child.get("Score"))
        #view_count = int(child.get("ViewCount"))
        post_size = len(child.get("Body"))
        tags = child.get("Tags")

        if user_id not in users:
                users[user_id] =  User(user_id)
        user = users[user_id]
        if post_type == 1:
                user.post_type_1_count += 1
        else:
                user.post_type_2_count += 1
        user.aggregate_post_score += score
        user.aggregate_post_size += post_size

        if tags != None:
                tags = tags.replace("<", " ").replace(">", " ").split()
                for tag in tags:
                        if tag not in user.tags_count:
                                user.tags_count[tag] = 0
                        user.tags_count[tag] += 1

for i in users:
        user = users[i]
        out = "%d %d %d %d %d " % (user.id, user.post_type_1_count, user.post_type_2_count, user.aggregate_post_score, user.aggregate_post_size)
        for tag in user.tags_count:
                out += "%s %d " % (tag, user.tags_count[tag])
        print out

posts_reducer.py:

#!/usr/bin/env python

import sys


class User(object):

        def __init__(self, id):
                self.id = id
                self.post_type_1_count = 0
                self.post_type_2_count = 0
                self.aggregate_post_score = 0
                self.aggregate_post_size = 0
                self.tags_count = {}

users = {}
for line in sys.stdin:

        vals = line.split()
        user_id = int(vals[0])
        post_type_1 = int(vals[1])
        post_type_2 = int(vals[2])
        aggregate_post_score = int(vals[3])
        aggregate_post_size = int(vals[4])
        tags = {}
        if len(vals) > 5:
                #this means we got tags
                for i in range (5, len(vals), 2):
                        tag = vals[i]
                        count = int((vals[i+1]))
                        tags[tag] = count

        if user_id not in users:
                users[user_id] = User(user_id)
        user = users[user_id]
        user.post_type_1_count += post_type_1
        user.post_type_2_count += post_type_2
        user.aggregate_post_score += aggregate_post_score
        user.aggregate_post_size += aggregate_post_size
        for tag in tags:
                if tag not in user.tags_count:
                        user.tags_count[tag] = 0
                user.tags_count[tag] += tags[tag]

for i in users:
        user = users[i]
        out = "%d %d %d %d %d " % (user.id, user.post_type_1_count, user.post_type_2_count, user.aggregate_post_score, user.aggregate_post_size)
        for tag in user.tags_count:
                out += "%s %d " % (tag, user.tags_count[tag])
        print out

我运行命令:

bin/hadoop jar hadoop-streaming-2.6.0.jar -input /stackexchange/beer/posts -output /stackexchange/beer/results -mapper posts_mapper.py -reducer posts_reducer.py -file ~/mapreduce/posts_mapper.py -file ~/mapreduce/posts_reducer.py

并获得输出:

packageJobJar:[/home/hduser/mapreduce/posts_mapper.py,/home/hduser/mapreduce/posts_reducer.py,/ tmp / hadoop-unjar6585010774815976682 /] [] /tmp/streamjob8863638738687983603.jar tmpDir = null 15/03/20 10:18:55 INFO client.RMProxy:在Master / 10.1.1.22上连接到ResourceManager:8040 15/03/20 10:18:55 INFO client.RMProxy:在Master / 10.1.1.22上连接到ResourceManager:8040 15/03/20 10:18:57 INFO mapred.FileInputFormat:要处理的总输入路径:10 15/03/20 10:18:57 INFO mapreduce.JobSubmitter:分裂数:10 15/03/20 10:18:57 INFO mapreduce.JobSubmitter:提交工作代币:job_1426769192808_0004 15/03/20 10:18:58 INFO impl.YarnClientImpl:提交的应用程序application_1426769192808_0004 15/03/20 10:18:58 INFO mapreduce.Job:跟踪工作的网址:http://i-644dd931:8088/proxy/application_1426769192808_0004/ 15/03/20 10:18:58 INFO mapreduce.Job:正在运行的职位:job_1426769192808_0004 15/03/20 10:19:11 INFO mapreduce.Job:在uber模式下运行的job job_1426769192808_0004:false 15/03/20 10:19:11 INFO mapreduce.Job:地图0%减少0% 15/03/20 10:19:41 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000006_0,状态:未通过 15/03/20 10:19:48 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000007_0,状态:未通过 15/03/20 10:19:50 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000008_0,状态:未通过 15/03/20 10:19:50 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000009_0,状态:未通过 15/03/20 10:20:00 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000006_1,状态:未通过 15/03/20 10:20:08 INFO mapreduce.Job:地图7%减少0% 15/03/20 10:20:10 INFO mapreduce.Job:地图20%减少0% 15/03/20 10:20:10 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000007_1,状态:未通过 15/03/20 10:20:11 INFO mapreduce.Job:地图10%减少0% 15/03/20 10:20:17 INFO mapreduce.Job:地图20%减少0% 15/03/20 10:20:17 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000008_1,状态:未通过 15/03/20 10:20:19 INFO mapreduce.Job:地图10%减少0% 15/03/20 10:20:19 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000009_1,状态:未通过 15/03/20 10:20:22 INFO mapreduce.Job:地图20%减少0% 15/03/20 10:20:22 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000006_2,状态:未通过 15/03/20 10:20:25 INFO mapreduce.Job:地图40%减少0% 15/03/20 10:20:25 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000002_0,状态:未通过 错误:java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为1     在org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:322)     在org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:535)     在org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)     在org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)     在org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)     在org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)     在org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)     在org.apache.hadoop.mapred.YarnChild $ 2.run(YarnChild.java:163)     at java.security.AccessController.doPrivileged(Native Method)     在javax.security.auth.Subject.doAs(Subject.java:415)     在org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)     在org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

15/03/20 10:20:28 INFO mapreduce.Job:地图50%减少0% 15/03/20 10:20:28 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000007_2,状态:未通过 15/03/20 10:20:42 INFO mapreduce.Job:地图50%减少17% 15/03/20 10:20:52 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000008_2,状态:未通过 15/03/20 10:20:54 INFO mapreduce.Job:任务ID:attempt_1426769192808_0004_m_000009_2,状态:未通过 15/03/20 10:20:56 INFO mapreduce.Job:地图90%减少0% 15/03/20 10:20:57 INFO mapreduce.Job:地图100%减少100% 15/03/20 10:20:58 INFO mapreduce.Job:作业job_1426769192808_0004因状态失败而失败,原因是:任务失败task_1426769192808_0004_m_000006 任务失败,作业失败。 failedMaps:1次失败减少:0

1 个答案:

答案 0 :(得分:0)

不幸的是,hadoop没有为你的python mapper / reducer显示stderr,所以这个输出没有提供任何线索。

我建议您执行以下2个故障排除步骤:

  1. 在本地测试您的mapper / reducer:
  2. cat {your_input_files} | ./posts_mapper.py | sort | ./posts_reducer.py

    1. 如果在步骤1中未发现任何问题,请创建地图缩减作业并检查输出日志:
    2. yarn logs -applicationId application_1426769192808_0004

      hdfs dfs -cat /var/log/hadoop-yarn/apps/{user}/logs/