Question

我在hdfs上有一些定制的基于thrift-compact的序列化格式文件作为输入。通过扩展hadoop的FileInputFormat，我们可以快速将文件加载到RDD结构中。

现在，在应用了一些groupBy转换后，输出RDD变为JavaPairRDD<Long, Iterable<thriftGeneratedClass>>。现在，我想通过PairRDD中的密钥将groupBy RDD结果保存回具有多个输出文件的HDFS。 E.X。：如果我们在PairRDD中有两个键100,200，将生成两个文件100.thrift和200.thrift。每个文件都包含thrift类的所有可迭代列表。代码如下：

//Feature is some thrift generated class 
JavaRDD<Feature> featureJavaRDD = jsc.newAPIHadoopFile(inputPath, ThriftInputFormat.class,
            NullWritable.class, Feature.class, jsc.hadoopConfiguration()).values();
JavaPairRDD<Long, Iterable<Feature>> groupByRDD = featureJavaRDD.groupBy(...)
//how to save groupByRDD results back to HDFS with files by key

我的问题是：实现这一目标的最佳途径是什么？我知道答案可能涉及hadoop saveAsNewAPIHadoopFile和MultipleOutputs。

Answer 1

几天前我有一个类似的用例，我通过编写两个实现MultipleTextOutputFormat和RecordWriter的自定义类来解决它。

我的输入是JavaPairRDD<String, List<String>>，我希望将其存储在由其键命名的文件中，其值包含所有行。（所以，这几乎是相同的用例）

以下是我MultipleTextOutputFormat实施的代码

class RDDMultipleTextOutputFormat<K, V> extends MultipleTextOutputFormat<K, V> {

    @Override
    protected String generateFileNameForKeyValue(K key, V value, String name) {
        return key.toString(); //The return will be used as file name
    }

    /** The following 4 functions are only for visibility purposes                 
    (they are used in the class MyRecordWriter) **/
    protected String generateLeafFileName(String name) {
        return super.generateLeafFileName(name);
    }

    protected V generateActualValue(K key, V value) {
        return super.generateActualValue(key, value);
    }

    protected String getInputFileBasedOutputFileName(JobConf job,     String name) {
        return super.getInputFileBasedOutputFileName(job, name);
        }

    protected RecordWriter<K, V> getBaseRecordWriter(FileSystem fs, JobConf job, String name, Progressable arg3) throws IOException {
        return super.getBaseRecordWriter(fs, job, name, arg3);
    }

    /** Use my custom RecordWriter **/
    @Override
    RecordWriter<K, V> getRecordWriter(final FileSystem fs, final JobConf job, String name, final Progressable arg3) throws IOException {
    final String myName = this.generateLeafFileName(name);
        return new MyRecordWriter<K, V>(this, fs, job, arg3, myName);
    }
}

以下是我RecordWriter实施的代码。

class MyRecordWriter<K, V> implements RecordWriter<K, V> {

    private RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat;
    private final FileSystem fs;
    private final JobConf job;
    private final Progressable arg3;
    private String myName;

    TreeMap<String, RecordWriter<K, V>> recordWriters = new TreeMap();

    MyRecordWriter(RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat, FileSystem fs, JobConf job, Progressable arg3, String myName) {
        this.rddMultipleTextOutputFormat = rddMultipleTextOutputFormat;
        this.fs = fs;
        this.job = job;
        this.arg3 = arg3;
        this.myName = myName;
    }

    @Override
    void write(K key, V value) throws IOException {
        String keyBasedPath = rddMultipleTextOutputFormat.generateFileNameForKeyValue(key, value, myName);
        String finalPath = rddMultipleTextOutputFormat.getInputFileBasedOutputFileName(job, keyBasedPath);
        Object actualValue = rddMultipleTextOutputFormat.generateActualValue(key, value);
        RecordWriter rw = this.recordWriters.get(finalPath);
        if(rw == null) {
            rw = rddMultipleTextOutputFormat.getBaseRecordWriter(fs, job, finalPath, arg3);
            this.recordWriters.put(finalPath, rw);
        }
        List<String> lines = (List<String>) actualValue;
        for (String line : lines) {
            rw.write(null, line);
        }
    }

    @Override
    void close(Reporter reporter) throws IOException {
        Iterator keys = this.recordWriters.keySet().iterator();

        while(keys.hasNext()) {
            RecordWriter rw = (RecordWriter)this.recordWriters.get(keys.next());
            rw.close(reporter);
        }

        this.recordWriters.clear();
    }
}

大部分代码与FileOutputFormat完全相同。唯一的区别是那几行

List<String> lines = (List<String>) actualValue;
for (String line : lines) {
    rw.write(null, line);
}

这些行允许我在文件上写下输入List<String>的每一行。 write函数的第一个参数设置为null，以避免在每一行上写入密钥。

要完成，我只需要执行此调用来编写我的文件

javaPairRDD.saveAsHadoopFile(path, String.class, List.class, RDDMultipleTextOutputFormat.class);

将groupBy rdd结果保存回HDFS

1 个答案: