Question

我正在尝试从HBase表中生成mahout向量。 Mahout需要向量的序列文件作为其输入。我得到的印象是我无法从使用HBase作为源的map-reduce作业写入序列文件。这里什么都没有：

public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
    JobConf jobConf = new JobConf();
    jobConf.setMapOutputKeyClass(LongWritable.class);
    jobConf.setMapOutputValueClass(VectorWritable.class);

    // we want the vectors written straight to HDFS,
    // the order does not matter.
    jobConf.setNumReduceTasks(0);

    jobConf.setOutputFormat(SequenceFileOutputFormat.class);

    Path outputDir = new Path("/home/cloudera/house_vectors");
    FileSystem fs = FileSystem.get(configuration);
    if (fs.exists(outputDir)) {
        fs.delete(outputDir, true);
    }

    FileOutputFormat.setOutputPath(jobConf, outputDir);

    // I want the mappers to know the max and min value
    // so they can normalize the data.
    // I will add them as properties in the configuration,
    // by serializing them with avro.
    String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
            maximumHouse));
    jobConf.set("minmax", minmax);

    Job job = Job.getInstance(jobConf);
    Scan scan = new Scan();
    scan.addFamily(Bytes.toBytes("data"));
    TableMapReduceUtil.initTableMapperJob("homes", scan,
            HouseVectorizingMapper.class, LongWritable.class,
            VectorWritable.class, job);

    job.waitForCompletion(true);
}

我有一些测试代码可以运行它，但我明白了：

java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
    at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
    at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
    at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
    at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

所以我认为我的问题是我正在使用import org.apache.hadoop.mapreduce.Job，而setOutputFormat方法需要一个org.apache.hadoop.mapreduce.OutputFormat实例，它是一个类。该类只有四个实现，而且没有一个用于序列文件。这是它的javadocs：

http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/OutputFormat.html

如果可以的话，我会使用Job类的旧API版本，但是HBase的TableMapReduceUtil只接受新API的Job。

我想我可以先将结果写成文本，然后再将第二个map / reduce作业转换为序列文件，但听起来非常低效。

还有旧的org.apache.hadoop.hbase.mapred.TableMapReduceUtil，但我不赞成使用它。

我的mahout jar是版本0.7-cdh4.5.0 我的HBase jar是版本0.94.6-cdh4.5.0 我所有的hadoop罐都是2.0.0-cdh4.5.0

有人请告诉我如何在我的情况下从M / R写入SequenceFile吗？

Answer 1

实际上，SequenceFileOutputFormat 是新OutputFormat的后代。您必须进一步了解javadoc中的 direct 子类才能找到它。

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/lib/output/SequenceFileOutputFormat.html

你可能在你的驱动程序类中导入了错误的（旧的）。由于您没有在代码示例中包含导入，因此无法确定您的问题。

Answer 2

对于我来说，这是使用Oozie的类似问题的缺失部分。来自braindump：

<!-- New API for map -->
<property>
    <name>mapred.mapper.new-api</name>
    <value>true</value>
</property>

<!-- New API for reducer -->
<property>
    <name>mapred.reducer.new-api</name>
    <value>true</value>
</property>

HBase，Map / Reduce和SequenceFiles：mapred.output.format.class与新的map API模式不兼容

2 个答案: