Question

我有一个大型文档语料库作为MapReduce作业（旧的hadoop API）的输入。在映射器中，我可以产生两种输出：一种是计数单词，另一种是产生minHash签名。我需要做的是：

输入是同一文档语料库，无需处理两次。我认为MultipleOutputs不是解决方案，因为我找不到将Mapper输出提供给两个不同Reduce类的方法。

简而言之，我需要的是以下内容：

               WordCounting Reducer   --> WordCount output
             /

输入 - ＆gt; Mapper

             \ 
              MinHash Buckets Reducer --> MinHash output

有没有办法使用相同的Mapper（在同一个工作中），还是应该将它分成两个工作？

Answer 1

你可以做到，但它将涉及一些编码技巧（分区程序和前缀约定）。想法是让mapper输出前缀为＆＃34; W：＆＃34;和minhash前缀为＆＃34; M：＆＃34;。比使用分区程序来决定它需要进入哪个分区（也称为reducer）。

伪代码主要方法：

Set number of reducers to 2

MAPPER：

.... parse the word ...
... generate minhash ..
context.write("W:" + word, 1);
context.write("M:" + minhash, 1);

分区：

IF Key starts with "W:" { return 0; } // reducer 1
IF Key starts with "M:" { return 1; } // reducer 2

合

IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;} 
Iterate and context.write all of the values

减速机：

IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;} 
IF Key starts with "M:" { perform min hash logic }

在输出部分-0000中将是字数和部分0001的最小哈希计算。

不幸的是，不可能提供不同的Reducer类，但使用IF和前缀可以模拟它。

从性能的角度来看，只有2个减少器可能效率不高，而且可以使用分区器将前N个分区分配给字数。

如果您不喜欢前缀构思，那么您需要使用自定义WritableComparable类为密钥实现二级排序。但只有在更复杂的情况下才值得努力。

Answer 2

AFAIK这在单个map reduce工作中是不可能的，只有默认的输出文件部分--r - 0000文件将被送到reducer，所以如果你创建两个命名为WordCount的多个命名输出 - m - 0和MinHash - m - 0

您可以使用Identity Mapper和相应的Reducers创建另外两个不同的Map / Reduce作业，将输入指定为hdfspath / WordCount - *和hdfspath / MinHash - *作为相应作业的输入。