Question

我是Hadoop和MapReduce的新手，并且一直在尝试根据键将输出写入多个文件。任何人都可以提供关于如何使用它的清晰的想法或Java代码片段示例。我的映射器工作正常，在洗牌后，按预期获得键和相应的值。谢谢！

我要做的是只输入从输入文件到新文件的少量记录。因此，新输出文件应仅包含那些必需的记录，忽略其余的无关记录。即使我不使用MultipleTextOutputFormat，这也可以正常工作。我在mapper中实现的逻辑如下：

 public static class MapClass extends
            Mapper {

    StringBuilder emitValue = null;
    StringBuilder emitKey = null;
    Text kword = new Text();
    Text vword = new Text();

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String[] parts;

        String line = value.toString();
        parts = line.split(" ");

            kword.set(parts[4].toString());
            vword.set(line.toString());
            context.write(kword, vword);
        }
    }

减少的输入是这样的：
[KEY1] - GT; [value1，value2，...]
[KEY2] - GT; [value1，value2，...]
[KEY3] - GT; [value1，value2，...]＆amp;等等
我的兴趣在[key2] - ＆gt; [value1，value2，...]忽略其他键和相应的值。请帮我减速机。

Answer 1

使用MultipleOutputs可以让您在多个文件中发出记录，但只能在一组预定义的数量/类型的文件中，而不是任意数量的文件，而不是根据文件名的即时决定核心价值。

您可以通过扩展org.apache.hadoop.mapred.lib.MultipleTextOutputFormat来创建自己的OutputFormat。您的OutputFormat类应根据reducer发出的键/值来决定输出文件名和文件夹。这可以通过以下方式实现：

 package oddjob.hadoop;

 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

 public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {

        /**
        * Use they key as part of the path for the final output file.
        */
       @Override
       protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
             return new Path(key.toString(), leaf).toString();
       }

       /**
        * When actually writing the data, discard the key since it is already in
        * the file path.
        */
       @Override
       protected Text generateActualKey(Text key, Text value) {
             return null;
          }
 }

有关详细信息，请阅读here。

PS：您需要使用旧的mapred API来实现这一目标。与在较新的API中一样，尚不支持MultipleTextOutput！请参阅this。

如何使用MultipleOutputs <keyout，valueout>将输出数据写入多个输出</keyout，valueout>

1 个答案: