Question

我需要分割多组二进制文件（根据某些逻辑）并分发给映射器。我为此使用了Hadoop streaming。主要问题是通过线路发送精确二进制块而不改变它们。事实证明，发送原始字节并非易事。

为了更好地说明问题，我编写了一个非常简单的扩展RecordReader类，它应该从拆分中读取一些字节并发送它们。二进制数据可以包含任何内容（包括换行符）。以下是next()可能阅读的内容：

public class MyRecordReader implements
        RecordReader<BytesWritable, BytesWritable> {
    ...
    public boolean next(BytesWritable key, BytesWritable ignore)
            throws IOException {
        ...

        byte[] result = new byte[8];
        for (int i = 0; i < result.length; ++i)
            result[i] = (byte)(i+1);
        result[3] = (byte)'\n';
        result[4] = (byte)'\n';

        key.set(result, 0, result.length);
        return true;
    }
}

在这种情况下，每次调用next()函数都应该将以下字节序列写入stdin：01 02 03 0a 0a 06 07 08。如果我使用类型字节（Hadoop-1722），则序列应为prefixed with five bytes in total，第一个字节用于序列的类型（0表示字节），其他四个字节用于大小。所以序列应如下所示：00 00 00 00 08 01 02 03 { {1}} 0a 0a 06 07。

我针对08对其进行了测试以验证结果，命令如下：

/bin/cat

使用hadoop jar <streaming jar location> -libjars <my input format jar> -D stream.map.input=typedbytes -mapper /bin/cat -inputformat my.input.Format查看传入的密钥我得到了：hexdump 00 00 00 00 08 {{ 1}} 01 02 03 09 0a 09 {{1 }} 0a。正如您所看到的，每个06（换行符）都以07为前缀（tab），但是输入的字节给出了（先前）有关字节序列的类型和大小的正确信息。

这在使用其他语言编写映射器时会产生严重问题，因为字节会在路上被更改。

似乎无法保证字节将完全按原样发送，除非有另一个我遗漏的东西？

Answer 1

由于hadoop-user邮件列表中非常有用hint，我找到了解决此问题的方法。

简而言之，我们需要覆盖Hadoop IO如何向/从标准流写入/读取数据。要做到这一点：

扩展InputWriter，OutputReader，同时提供您自己的InputFormat和OutputFormat，以便您完全控制字节写入和读取的方式来自溪流。
扩展IdentifierResolver课程，以告知Hadoop使用您自己的InputWriter和OutputReader。

使用您的IdentifierResolver，InputFormat和OuputFormat，如下所示：

hadoop jar <streaming jar location>
-D stream.io.identifier.resolver.class=my.own.CustomIdentifierResolver
-libjars <my input format jar>
-mapper /bin/cat
-inputformat my.own.CustomInputFormat
-outputformat my.own.CustomOutputFormat
<other options ...>

功能（未合并）MAPREDUCE-5018中提供的补丁是如何执行此操作的重要来源，可以根据需要进行自定义。

使用Hadoop流发送精确的二进制序列

1 个答案: