我正在尝试使用SequenceFile在两个mapReduce程序之间传递数据。我要传递的数据格式为>。 出于某种原因,似乎地图中的某些条目不会从一个程序传递到另一个程序。 这是我的代码,首先是生成de SequenceFileOutput的reducer,然后是从中读取的mapper。
公共静态类IntSumReducer 延伸减速器{
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
MapWritable vector = new MapWritable() ;
for (Text val : values){
if(vector.containsKey(val)){
vector.put(val , new IntWritable(((IntWritable)vector.get(val)).get() + 1));
}
else
vector.put(val , new IntWritable(1));
}
context.write(key, vector);
}
}
和映射器:
公共静态类TokenizerMapper 扩展Mapper {
private final static int cota = 100;
private final static double ady = 0.25;
public void map(Text key, MapWritable value, Context context
) throws IOException, InterruptedException {
IntWritable tot = (IntWritable)value.get(key);
int total = tot.get();
if(total > cota){
MapWritable vector = new MapWritable() ;
Set<Writable> keys = value.keySet();
Iterator<Writable> iterator = keys.iterator();
while(iterator.hasNext()){
Text llave = (Text) iterator.next();
if(!llave.equals(key)){
IntWritable cant = (IntWritable) value.get(llave);
double rel = (((double)cant.get())/(double)total);
if(cant.get() > cota && rel > ady ){
vector.put(llave, new DoubleWritable(rel));
}
}
}
context.write(key,vector);
}
}
}
答案 0 :(得分:1)
for (Text val : values){
if(vector.containsKey(val)){
vector.put(val , new IntWritable(((IntWritable)vector.get(val)).get() + 1));
}
else
vector.put(val , new IntWritable(1));
}
这就是你的问题 - val文本对象被hadoop重用,所以在调用vector.put时你应该创建一个新的Text对象来脱离val引用(其值将在for的下一次迭代中改变)循环)。
你可以修改你的逻辑,然后它应该工作(我也更新了计数器增量逻辑,以便更高效):
IntWritable tmpInt;
for (Text val : values){
tmpInt = (IntWritable) vector.get(val);
if(tmpInt == null) {
tmpInt = new IntWritable(0);
// create a copy of val Text object
vector.put(new Text(val), tmpInt);
}
// update the IntWritable wrapped int value
tmpInt.set(tmpInt.get() + 1);
// Note: you don't need to re-insert the IntWritable into the map
}