我正在尝试使用reducer,输入(键,值)对的格式如下:
reducer的输出是(key,value)对
在我计算tf-idf
之前,公式要求我知道两件事不知何故,似乎我必须循环遍历values
两次,一次获取包含该单词的文件数量,以及另一次处理tf-idf。
下面的伪代码:
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Note: key is a word, values are in the form of
// (filename=frequency)
// sum up the number of files containing a particular word
// for every filename=frequency in the value, compute tf-idf of this
// word in filename and output (word@filename, tfidf)
}
}
我读到无法循环values
两次。一种替代方案可能是使用“缓存”,我试过,但结果出来很不稳定。
答案 0 :(得分:0)
Text outputKey = new Text();
Text outputValue = new Text();
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Note: key is a word, values are in the form of
// (filename=frequency)
Map<String, Integer> tfs = new HashMap<>();
for (Text value: values) {
String[] valueParts = value.split("=");
tfs.put(valueParts[0], Integer.parseInt(valueParts[1])); //do the necessary checks here
}
int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter()
double IDF = Math.log10((double)numDocs/tfs.keySet().size());
// for every filename=frequency in the value, compute tf-idf of this
// word in filename and output (word@filename, tfidf)
for (String file : tfs.keySet()) {
outputKey.set(key.toString()+"@"+file);
outputValue.set(new String(tfs.get(file)*IDF)); //you could also set the outputValue to be a DoubleWritable
context.write(outputKey, outputValue);
}
}
}
如果您将tf定义为frequency / maxFrequency
,您可以在第一个循环中找到maxFrequency并相应地更改outputValue
。
如果您想尝试单循环解决方案,则需要获取IDF
,因此您需要获取输入values
的编号。
你可以使用以下方法在Java 8中完成这个技巧:
long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);
如this post中所述,或者按照同一帖子中不使用循环的其他建议(否则,您可以按照上一个答案)。
在这种情况下,您的代码将是(我没有尝试过):
Text outputKey = new Text();
Text outputValue = new Text();
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter()
long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);
// Note: key is a word, values are in the form of
// (filename=frequency)
for (Text value: values) {
String[] valueParts = value.split("=");
outputKey.set(key.toString()+"@"+valueParts[0]);
outputValue.set(new String(Integer.parseInt(valueParts[1]) * IDF);
context.write(outputKey, outputValue);
}
}
}
这也会节省一些内存,因为您不需要额外的地图(如果有效的话)。
编辑:上面的代码假设您已经拥有文件名的每个单词的总频率,即相同的文件名不会多次出现在值中,但您可能需要检查它是否成立。否则,第二个解决方案不起作用,因为你必须在第一个循环中对每个文件进行总频率求和。