Question

我的mapreduce作业目前使用以下结构中的multipleoutputs（as explained here）生成输出：

输出的基本路径为/dev/project/job1/output 但是，另一个作业（job2）生成类似的数据，我希望此作业（job1）的输出与另一个作业（job2）的输出合并。

我正在尝试将生成的输出合并到包含上述结构的公共输出目录(/dev/project/combinedoutput)，并将两个作业的输出组合在一起。有没有办法通过手动运行shell命令在作业本身做到这一点？

欣赏任何见解。

Answer 1

在工作本身？不是真的，但是你可以在完成工作后在main函数中完成它

//prior code above

job.waitForCompletion(true);

FileSystem fs = FileSystem.get(conf);

String job1Dir = "/dev/project/job1/output";
String combinedDir = "(/dev/project/combinedoutput";

Path job1Path = new Path(job1Dir+*/);


FileStatus[] job1Files = fs.globStatus(job1Path);

for(file: job1Files){
    if(file.isFile()){
        String fullFileName = file.getPath().toString();
        String belowMainDir = fullFileName.subString(job1Dir.length());
        String newFileName = combinedDir+belowMainDir+"job1";
        fs.mkdirs(new Path(newFileName.subString(0,lastIndexOf("/")));
        fs.rename(file.getPath(),new Path(combinedDir+belowMainDir+"job1");
    }
}

这应该让您的文件移动过来。对job2做同样的事情，你应该设置。（可选）您可以更改代码，使其复制而不是重命名，和/或在完成后删除原始job1 / job2目录。

将mapreduce的输出合并到另一个目录结构中

1 个答案: