我在hdfs上有一些定制的基于thrift-compact的序列化格式文件作为输入。通过扩展hadoop的FileInputFormat,我们可以快速将文件加载到RDD结构中。
现在,在应用了一些groupBy转换后,输出RDD变为JavaPairRDD<Long, Iterable<thriftGeneratedClass>>
。
现在,我想通过PairRDD中的密钥将groupBy RDD结果保存回具有多个输出文件的HDFS。 E.X。:如果我们在PairRDD中有两个键100,200,将生成两个文件100.thrift和200.thrift。每个文件都包含thrift类的所有可迭代列表。
代码如下:
//Feature is some thrift generated class
JavaRDD<Feature> featureJavaRDD = jsc.newAPIHadoopFile(inputPath, ThriftInputFormat.class,
NullWritable.class, Feature.class, jsc.hadoopConfiguration()).values();
JavaPairRDD<Long, Iterable<Feature>> groupByRDD = featureJavaRDD.groupBy(...)
//how to save groupByRDD results back to HDFS with files by key
我的问题是:实现这一目标的最佳途径是什么?我知道答案可能涉及hadoop saveAsNewAPIHadoopFile
和MultipleOutputs
。
答案 0 :(得分:0)
几天前我有一个类似的用例,我通过编写两个实现MultipleTextOutputFormat
和RecordWriter
的自定义类来解决它。
我的输入是JavaPairRDD<String, List<String>>
,我希望将其存储在由其键命名的文件中,其值包含所有行。 (所以,这几乎是相同的用例)
以下是我MultipleTextOutputFormat
实施的代码
class RDDMultipleTextOutputFormat<K, V> extends MultipleTextOutputFormat<K, V> {
@Override
protected String generateFileNameForKeyValue(K key, V value, String name) {
return key.toString(); //The return will be used as file name
}
/** The following 4 functions are only for visibility purposes
(they are used in the class MyRecordWriter) **/
protected String generateLeafFileName(String name) {
return super.generateLeafFileName(name);
}
protected V generateActualValue(K key, V value) {
return super.generateActualValue(key, value);
}
protected String getInputFileBasedOutputFileName(JobConf job, String name) {
return super.getInputFileBasedOutputFileName(job, name);
}
protected RecordWriter<K, V> getBaseRecordWriter(FileSystem fs, JobConf job, String name, Progressable arg3) throws IOException {
return super.getBaseRecordWriter(fs, job, name, arg3);
}
/** Use my custom RecordWriter **/
@Override
RecordWriter<K, V> getRecordWriter(final FileSystem fs, final JobConf job, String name, final Progressable arg3) throws IOException {
final String myName = this.generateLeafFileName(name);
return new MyRecordWriter<K, V>(this, fs, job, arg3, myName);
}
}
以下是我RecordWriter
实施的代码。
class MyRecordWriter<K, V> implements RecordWriter<K, V> {
private RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat;
private final FileSystem fs;
private final JobConf job;
private final Progressable arg3;
private String myName;
TreeMap<String, RecordWriter<K, V>> recordWriters = new TreeMap();
MyRecordWriter(RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat, FileSystem fs, JobConf job, Progressable arg3, String myName) {
this.rddMultipleTextOutputFormat = rddMultipleTextOutputFormat;
this.fs = fs;
this.job = job;
this.arg3 = arg3;
this.myName = myName;
}
@Override
void write(K key, V value) throws IOException {
String keyBasedPath = rddMultipleTextOutputFormat.generateFileNameForKeyValue(key, value, myName);
String finalPath = rddMultipleTextOutputFormat.getInputFileBasedOutputFileName(job, keyBasedPath);
Object actualValue = rddMultipleTextOutputFormat.generateActualValue(key, value);
RecordWriter rw = this.recordWriters.get(finalPath);
if(rw == null) {
rw = rddMultipleTextOutputFormat.getBaseRecordWriter(fs, job, finalPath, arg3);
this.recordWriters.put(finalPath, rw);
}
List<String> lines = (List<String>) actualValue;
for (String line : lines) {
rw.write(null, line);
}
}
@Override
void close(Reporter reporter) throws IOException {
Iterator keys = this.recordWriters.keySet().iterator();
while(keys.hasNext()) {
RecordWriter rw = (RecordWriter)this.recordWriters.get(keys.next());
rw.close(reporter);
}
this.recordWriters.clear();
}
}
大部分代码与FileOutputFormat
完全相同。唯一的区别是那几行
List<String> lines = (List<String>) actualValue;
for (String line : lines) {
rw.write(null, line);
}
这些行允许我在文件上写下输入List<String>
的每一行。 write
函数的第一个参数设置为null
,以避免在每一行上写入密钥。
要完成,我只需要执行此调用来编写我的文件
javaPairRDD.saveAsHadoopFile(path, String.class, List.class, RDDMultipleTextOutputFormat.class);