最终解决了问题检查底部的解决方案
最近我试图在Mahout in Action的chaper6(列出6.1~6.4)中运行推荐示例。但我遇到了一个问题,我已经google了一下,但我找不到解决方案。
问题在于:我有一对mapper-reducer
public final class WikipediaToItemPrefsMapper extends
Mapper<LongWritable, Text, VarLongWritable, VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
public class WikipediaToUserVectorReducer
extends
Reducer<VarLongWritable, VarLongWritable, VarLongWritable, VectorWritable> {
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs, Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int) itemPref.get(), 1.0f);
}
context.write(userID, new VectorWritable(userVector));
}
}
reducer输出一个userID和一个userVector,它看起来像这样: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0}
然后我想用另一对mapper-reducer处理这个数据
public class UserVectorSplitterMapper
extends
Mapper<VarLongWritable, VectorWritable, IntWritable, VectorOrPrefWritable> {
public void map(VarLongWritable key, VectorWritable value, Context context)
throws IOException, InterruptedException {
long userID = key.get();
Vector userVector = value.get();
Iterator<Vector.Element> it = userVector.iterateNonZero();
IntWritable itemIndexWritable = new IntWritable();
while (it.hasNext()) {
Vector.Element e = it.next();
int itemIndex = e.index();
float preferenceValue = (float) e.get();
itemIndexWritable.set(itemIndex);
context.write(itemIndexWritable,
new VectorOrPrefWritable(userID, preferenceValue));
}
}
}
当我尝试运行该作业时,它会抛出错误
org.apache.hadoop.io.Text无法强制转换为org.apache.mahout.math.VectorWritable
第一个mapper-reducer将输出写入hdfs,第二个mapper-reducer尝试读取输出,mapper可以将98955转换为VarLongWritable,但无法转换 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0}到VectorWritable,所以我想知道有没有办法让第一个mapper-reducer直接将输出发送到第二对,然后有无需进行数据转换。我已经查看了Hadoop的行动,并且hadoop:权威指南,似乎没有这样的方法可以做任何建议?
解决方案:通过使用 SequenceFileOutputFormat ,我们可以在DFS上输出并保存第一个MapReduce工作流的reduce结果,然后第二个MapReduce工作流可以使用读取临时文件作为输入SequenceFileInputFormat 类作为创建映射器时的参数。由于矢量将保存在具有特定格式的二进制序列文件中,因此 SequenceFileInputFormat 可以读取它并将其转换回矢量格式。
以下是一些示例代码:
confFactory ToItemPrefsWorkFlow = new confFactory
(new Path("/dbout"), //input file path
new Path("/mahout/output.txt"), //output file path
TextInputFormat.class, //input format
VarLongWritable.class, //mapper key format
Item_Score_Writable.class, //mapper value format
VarLongWritable.class, //reducer key format
VectorWritable.class, //reducer value format
**SequenceFileOutputFormat.class** //The reducer output format
);
ToItemPrefsWorkFlow.setMapper( WikipediaToItemPrefsMapper.class);
ToItemPrefsWorkFlow.setReducer(WikipediaToUserVectorReducer.class);
JobConf conf1 = ToItemPrefsWorkFlow.getConf();
confFactory UserVectorToCooccurrenceWorkFlow = new confFactory
(new Path("/mahout/output.txt"),
new Path("/mahout/UserVectorToCooccurrence"),
SequenceFileInputFormat.class, //notice that the input format of mapper of the second work flow is now SequenceFileInputFormat.class
//UserVectorToCooccurrenceMapper.class,
IntWritable.class,
IntWritable.class,
IntWritable.class,
VectorWritable.class,
SequenceFileOutputFormat.class
);
UserVectorToCooccurrenceWorkFlow.setMapper(UserVectorToCooccurrenceMapper.class);
UserVectorToCooccurrenceWorkFlow.setReducer(UserVectorToCooccurrenceReducer.class);
JobConf conf2 = UserVectorToCooccurrenceWorkFlow.getConf();
JobClient.runJob(conf1);
JobClient.runJob(conf2);
如果您对此有任何疑问,请随时与我联系
答案 0 :(得分:4)
您需要显式配置第一个作业的输出以使用SequenceFileOutputFormat并定义输出键和值类:
job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputKeyClass(VarLongWritable.class);
job.setOutputKeyClass(VectorWritable.class);
没有看到你的驱动程序代码,我猜你正在使用TextOutputFormat作为第一个作业的输出,而TextInputFormat作为第二个作业的输入 - 这个输入格式将<Text, Text>
对发送到第二个映射器
答案 1 :(得分:1)
我是hadoop的初学者,这只是我对答案的猜测,所以请耐心等待/指出它是否天真。
我认为在不保存HDFS的情况下从reducer发送到下一个映射器是不合理的。 因为“哪个数据分割到哪个映射器”的设计是为了满足地点标准。(转到mapper节点,它有本地存储的数据)。
如果您不将其存储在HDFS上,很可能所有数据都将通过网络传输,这可能会导致带宽问题。
答案 2 :(得分:0)
您必须暂时保存第一个map-reduce的输出,以便第二个可以使用它。
这可能有助于您了解第一个map-reduce的输出如何传递给第二个。 (这是基于Generator.java的Apache nutch)。
这是第一个map-reduce:
输出的临时目录Path tempDir =
new Path(getConf().get("mapred.temp.dir", ".")
+ "/job1-temp-"
+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
设置第一个map-reduce作业:
JobConf job1 = getConf();
job1.setJobName("job 1");
FileInputFormat.addInputPath(...);
sortJob.setMapperClass(...);
FileOutputFormat.setOutputPath(job1, tempDir);
job1.setOutputFormat(SequenceFileOutputFormat.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(...);
JobClient.runJob(job1);
观察输出目录是否在作业配置中设置。在第二份工作中使用它:
JobConf job2 = getConf();
FileInputFormat.addInputPath(job2, tempDir);
job2.setReducerClass(...);
JobClient.runJob(job2);
记得在完成后清理临时目录:
// clean up
FileSystem fs = FileSystem.get(getConf());
fs.delete(tempDir, true);
希望这有帮助。