Question

如何在reducer中使用MultipleOutputs类来编写多个输出，每个输出都有自己独特的配置？ MultipleOutputs javadoc中有一些文档，但它似乎仅限于文本输出。事实证明，MultipleOutputs可以处理每个输出的输出路径，键类和值类，但尝试使用需要使用其他配置属性的输出格式会失败。

（这个问题已经出现好几次，但是我的回答却被挫败了，因为提问者确实遇到了不同的问题。由于这个问题需要花费几天的时间来回答，我正在回答我在this Meta Stack Overflow question建议的问题。

Answer 1

我已经遍历了MultipleOutputs实现，并发现它不支持任何具有除outputDir，key class和value class之外的属性的OutputFormatType。我尝试编写自己的MultipleOutputs类，但失败了，因为它需要在Hadoop类中的某个地方调用私有方法。

我只剩下一个似乎适用于所有情况的解决方法以及输出格式和配置的所有组合：编写我想要使用的OutputFormat类的子类（这些是可重用的）。这些类理解其他OutputFormats同时使用并知道如何存储它们的属性。该设计利用了一个事实，即在被要求提供RecordWriter之前，可以使用上下文配置OutputFormat。

我已经将它与Cassandra的ColumnFamilyOutputFormat一起使用了：

package com.myorg.hadoop.platform;

import org.apache.cassandra.hadoop.ColumnFamilyOutputFormat;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;

public abstract class ConcurrentColumnFamilyOutputFormat 
                        extends ColumnFamilyOutputFormat 
                        implements Configurable {

private static String[] propertyName = {
        "cassandra.output.keyspace" ,
        "cassandra.output.keyspace.username" ,
        "cassandra.output.keyspace.passwd" ,
        "cassandra.output.columnfamily" ,
        "cassandra.output.predicate",
        "cassandra.output.thrift.port" ,
        "cassandra.output.thrift.address" ,
        "cassandra.output.partitioner.class"
        };

private Configuration configuration;

public ConcurrentColumnFamilyOutputFormat() {
    super();
}

public Configuration getConf() {
    return configuration;
}

public void setConf(Configuration conf) {

    configuration = conf;

    String prefix = "multiple.outputs." + getMultiOutputName() + ".";

    for (int i = 0; i < propertyName.length; i++) {
        String property = prefix + propertyName[i];
        String value = conf.get(property);
        if (value != null) {
            conf.set(propertyName[i], value);
        }
    }

}

public void configure(Configuration conf) {

    String prefix = "multiple.outputs." + getMultiOutputName() + ".";

    for (int i = 0; i < propertyName.length; i++) {
        String property = prefix + propertyName[i];
        String value = conf.get(propertyName[i]);
        if (value != null) {
            conf.set(property, value);
        }
    }

}

public abstract String getMultiOutputName();

}

对于你想要减速器的每个Cassandra（在本例中）输出，你有一个类：

package com.myorg.multioutput.ReadCrawled;

import com.myorg.hadoop.platform.ConcurrentColumnFamilyOutputFormat;

public class StrongOutputFormat extends ConcurrentColumnFamilyOutputFormat {

    public StrongOutputFormat() {
        super();
    }

    @Override
    public String getMultiOutputName() {
        return "Strong";
    }

}

并在mapper / reducer配置类中配置它：

    // This is how you'd normally configure the ColumnFamilyOutputFormat

ConfigHelper.setOutputColumnFamily(job.getConfiguration(), "Partner", "Strong");
ConfigHelper.setOutputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setOutputInitialAddress(job.getConfiguration(), "localhost");
ConfigHelper.setOutputPartitioner(job.getConfiguration(), "org.apache.cassandra.dht.RandomPartitioner");

    // This is how you tell the MultipleOutput-aware OutputFormat that
    // it's time to save off the configuration so no other OutputFormat
    // steps all over it.

new StrongOutputFormat().configure(job.getConfiguration());

    // This is where we add the MultipleOutput-aware ColumnFamilyOutputFormat
    // to out set of outputs

MultipleOutputs.addNamedOutput(job, "Strong", StrongOutputFormat.class, ByteBuffer.class, List.class);

再举一个例子，FileOutputFormat的MultipleOutput子类使用以下属性：

    private static String[] propertyName = {
        "mapred.output.compression.type" ,
        "mapred.output.compression.codec" ,
        "mapred.output.compress" ,
        "mapred.output.dir"
        };

并且将像上面的ConcurrentColumnFamilyOutputFormat一样实现，除了它将使用上述属性。

Answer 2

我已经为Cassandra实现了MultipleOutputs支持（请参阅this JIRA ticket，目前计划在1.2中发布。如果您现在需要，可以在故障单中应用补丁。还可以查看{{3}关于主题，给出了它的用法示例。

如何在hadoop reducer中编写不同格式的多个输出？

2 个答案: