我正在尝试从HBase表中生成mahout向量。 Mahout需要向量的序列文件作为其输入。我得到的印象是我无法从使用HBase作为源的map-reduce作业写入序列文件。这里什么都没有:
public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
JobConf jobConf = new JobConf();
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(VectorWritable.class);
// we want the vectors written straight to HDFS,
// the order does not matter.
jobConf.setNumReduceTasks(0);
jobConf.setOutputFormat(SequenceFileOutputFormat.class);
Path outputDir = new Path("/home/cloudera/house_vectors");
FileSystem fs = FileSystem.get(configuration);
if (fs.exists(outputDir)) {
fs.delete(outputDir, true);
}
FileOutputFormat.setOutputPath(jobConf, outputDir);
// I want the mappers to know the max and min value
// so they can normalize the data.
// I will add them as properties in the configuration,
// by serializing them with avro.
String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
maximumHouse));
jobConf.set("minmax", minmax);
Job job = Job.getInstance(jobConf);
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("data"));
TableMapReduceUtil.initTableMapperJob("homes", scan,
HouseVectorizingMapper.class, LongWritable.class,
VectorWritable.class, job);
job.waitForCompletion(true);
}
我有一些测试代码可以运行它,但我明白了:
java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
所以我认为我的问题是我正在使用import org.apache.hadoop.mapreduce.Job,而setOutputFormat方法需要一个org.apache.hadoop.mapreduce.OutputFormat实例,它是一个类。该类只有四个实现,而且没有一个用于序列文件。这是它的javadocs:
http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/OutputFormat.html
如果可以的话,我会使用Job类的旧API版本,但是HBase的TableMapReduceUtil只接受新API的Job。
我想我可以先将结果写成文本,然后再将第二个map / reduce作业转换为序列文件,但听起来非常低效。
还有旧的org.apache.hadoop.hbase.mapred.TableMapReduceUtil,但我不赞成使用它。
我的mahout jar是版本0.7-cdh4.5.0 我的HBase jar是版本0.94.6-cdh4.5.0 我所有的hadoop罐都是2.0.0-cdh4.5.0
有人请告诉我如何在我的情况下从M / R写入SequenceFile吗?
答案 0 :(得分:0)
实际上,SequenceFileOutputFormat 是新OutputFormat的后代。您必须进一步了解javadoc中的 direct 子类才能找到它。
你可能在你的驱动程序类中导入了错误的(旧的)。由于您没有在代码示例中包含导入,因此无法确定您的问题。
答案 1 :(得分:0)
对于我来说,这是使用Oozie的类似问题的缺失部分。 来自braindump:
<!-- New API for map -->
<property>
<name>mapred.mapper.new-api</name>
<value>true</value>
</property>
<!-- New API for reducer -->
<property>
<name>mapred.reducer.new-api</name>
<value>true</value>
</property>