我一直在寻找使用简化数据的方法,以便在hadoop中进一步映射。我有类A
的对象作为输入数据和类B
的对象作为输出数据。问题是,虽然映射不仅生成了B
,而且生成了A
。
这是我想要实现的目标:
1.1 input: a list of As
1.2 map result: for each A a list of new As and a list of Bs is generated
1.3 reduce: filtered Bs are saved as output, filtered As are added to the map jobs
2.1 input: a list of As produced by the first map/reduce
2.2 map result: for each A a list of new As and a list of Bs is generated
2.3 ...
3.1 ...
你应该得到基本的想法。
我已经阅读了很多关于链接但我不确定如何组合ChainReducer和ChainMapper,或者即使这是正确的方法。
所以这是我的问题:如何在减少时拆分映射数据,将一个部分保存为输出,另一部分保存为新的输入数据。
答案 0 :(得分:2)
尝试使用MultipleOutputs。正如Javadoc建议的那样:
MultipleOutputs类简化了将输出数据写入多个数据的过程 输出
案例一:写入除作业默认值之外的其他输出 输出。可以配置每个附加输出或命名输出 有自己的OutputFormat,有自己的密钥类和自己的密钥类 价值等级。
案例二:将数据写入用户提供的不同文件
作业提交的使用模式:
Job job = new Job();
FileInputFormat.setInputPath(job, inDir);
FileOutputFormat.setOutputPath(job, outDir);
job.setMapperClass(MOMap.class);
job.setReducerClass(MOReduce.class);
...
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
LongWritable.class, Text.class);
// Defines additional sequence-file based output 'sequence' for the job
MultipleOutputs.addNamedOutput(job, "seq",
SequenceFileOutputFormat.class,
LongWritable.class, Text.class);
...
job.waitForCompletion(true);
...
在Reducer中的用法:
String generateFileName(K k, V v) {
return k.toString() + "_" + v.toString();
}
public class MOReduce extends
Reducer<WritableComparable, Writable,WritableComparable, Writable> {
private MultipleOutputs mos;
public void setup(Context context) {
...
mos = new MultipleOutputs(context);
}
public void reduce(WritableComparable key, Iterator<Writable> values,
Context context)
throws IOException {
...
mos.write("text", , key, new Text("Hello"));
mos.write("seq", LongWritable(1), new Text("Bye"), "seq_a");
mos.write("seq", LongWritable(2), key, new Text("Chau"), "seq_b");
mos.write(key, new Text("value"), generateFileName(key, new Text("value")));
...
}
public void cleanup(Context) throws IOException {
mos.close();
...
}
}