如何使用MultipleOutputs类在Hadoop中输出具有特定扩展名(如.csv)的文件

时间:2016-04-21 19:52:59

标签: java file hadoop mapreduce

我目前有一个MapReduce程序,它使用MultipleOutputs将结果输出到多个文件中。减速机看起来像这样:

private MultipleOutputs mo = new MultipleOutputs<NullWritable, Text>(context);
...
public void reduce(Edge keys, Iterable<NullWritable> values, Context context)
            throws IOException, InterruptedException {
        String date = records.formatDate(millis);
        out.set(keys.get(0) + "\t" + keys.get(1));
        parser.parse(key); 
        String filePath = String.format("%s/part", parser.getFileID());
        mo.write(noval, out, filePath);
    }

这与 Hadoop:The Definitive Guide 一书中的示例非常相似 - 但问题是它将文件作为纯文本输出。我希望我的文件以.csv文件的形式输出,并且无法在书中或网上找到对它的解释。

如何做到这一点?

1 个答案:

答案 0 :(得分:2)

Have you tried to iterate through your output folder after the completion of the Job object in your driver to rename the files?

As long as you emit in your reducer (the text should be the line in the csv with the values separated by semicolon or whatever you need) you can give a try to something like this:

Job job = new Job(getConf());
//...
//your job setup, including the output config 
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
//...
boolean success = job.waitForCompletion(true);
if (success){
    FileSystem hdfs = FileSystem.get(getConf());
    FileStatus fs[] = hdfs.listStatus(new Path(outputPath));
    if (fs != null){ 
        for (FileStatus aFile : fs) {
            if (!aFile.isDir()) {
                hdfs.rename(aFile.getPath(), new Path(aFile.getPath().toString()+".csv"));
            }
        }
    }
}