Question

我是Hadoop和MapReduce的新手，我正试图通过它。我正在尝试在python中开发mapreduce应用程序，其中我使用来自2个.CSV文件的数据。我只是在mapper中读取这两个文件，然后将文件中的键值对打印到sys.stdout

当我在一台机器上使用它时，程序运行正常，但是使用Hadoop Streaming，我收到错误。我想我在Hadoop上的mapper中读取文件时犯了一些错误。请帮我解决一下代码，并告诉我如何在Hadoop Streaming中使用文件处理。 mapper.py代码如下。（您可以理解评论中的代码）：

#!/usr/bin/env python
import sys
from numpy import genfromtxt

def read_input(inVal):
    for line in inVal:
        # split the line into words
        yield line.strip()

def main(separator='\t'):
    # input comes from STDIN (standard input)
    labels=[]
    data=[]    
    incoming = read_input(sys.stdin)
    for vals in incoming:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited;
        if len(vals) > 10:
            data.append(vals)
        else:
            labels.append(vals)

    for i in range(0,len(labels)):
        print "%s%s%s\n" % (labels[i], separator, data[i])


if __name__ == "__main__":
    main()

从以下两个.csv文件中输入60000条记录到这个映射器中（在单机上，而不是hadoop集群）：

cat mnist_train_labels.csv mnist_train_data.csv | ./mapper.py

Answer 1

在搜索了3天的解决方案后，我能够解决问题。

问题在于较新版本的Hadoop（在我的情况下为2.2.0）。当从文件读取值时，映射器代码在某个时刻给出退出代码非零（可能是因为它一次读取一个巨大的值列表（784））。 Hadoop 2.2.0中有一个设置，它告诉Hadoop系统发出一般错误（子进程失败，代码为1）。默认情况下，此设置设置为True。我只需要将此属性的值设置为False，它使我的代码运行没有任何错误。

设置为： stream.non.zero.exit.is.failure 。在流式传输时将其设置为false。所以流命令有点像：

**hadoop jar ... -D stream.non.zero.exit.is.failure=false ...**

希望它可以帮助某人，并节省3天......;）

Answer 2

您没有发布错误消息。在流式传输中，您需要传递-file参数或-input，以便将文件与您的流式传输作业一起上传，或者知道在hdfs上找到它的位置。

在Python中使用Hadoop Streaming中的文件

2 个答案: