Question

我有一些csv格式的数据。

例如K1，K2，data1，data2，data3

这里我的mapper将键传递给reducer作为K1K2 ＆安培;值为data1，data2，data3

我想将这些数据保存在多个文件中，文件名为K1k2（或减速器获取的密钥）。现在，如果我正在使用MultipleOutputs类，我必须在mapper开始之前提及文件名。但是在这里，因为只有在从mapper读取数据后，我才能确定密钥。我该怎么办？

PS我是新手。

Answer 1

您可以生成文件名并将它们传递给Reducer中的MultipleOutputs，如下所示：

public void setup(Context context) {
   out = new MultipleOutputs(context);
   ...
}

public void reduce(Text key, Iterable values, Context context) throws IOException,           InterruptedException {
  for (Text t : values) {
    out.write(key, t, generateFileName(<parameter list...>));
    // generateFileName is your function
  }
}

protected void cleanup(Context context) throws IOException, InterruptedException {
  out.close();
}

有关详细信息，请参阅MultipleOutputs类参考：https://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

Answer 2

无需预定义输出文件名。您可以像这样使用MultipleOutputs。

public class YourReducer extends Reducer<Text, Value, Text, Value> {
private Value result = null;
private MultipleOutputs<Text,Value> out;

 public void setup(Context context) {
   out = new MultipleOutputs<Text,Value>(context);    
 }
public void reduce(Text key, Iterable<Value> values, Context context)
        throws IOException, InterruptedException {
    // do your code
    out.write(key, result,"outputpath/"+key.getText());                
}
public void cleanup(Context context) throws IOException,InterruptedException {
    out.close();        
 }

}

这里它给出了以下路径中的输出

outputpath/K1
          /K2
          /K3
 .......

为此，您应使用LazyOutputFormat.setOutputFormatClass()代替FileOutputFormat。还需要将作业配置添加为job.setOutputFormatClass(NullOutputFormat.class)。但是不要忘记像以前一样使用FileOutputFormat.setOutputPath()和FileOutputFormat.setOutputPath()提供输入和输出路径。然后生成的文件将相对于指定的输出路径

如何在HADOOP中在运行时生成多个文件名？

2 个答案: