Question

我正在尝试使用reducer，输入（键，值）对的格式如下：

key：word
value：file = frequency，其中“file”是包含该单词的文件，“frequency”是单词出现在该单词中的次数文件

reducer的输出是（key，value）对

key：word = file
value：该文件中该单词的tf-idf

在我计算tf-idf

之前，公式要求我知道两件事

包含单词（即密钥）的文件数
文件中该词的个别频率

不知何故，似乎我必须循环遍历values两次，一次获取包含该单词的文件数量，以及另一次处理tf-idf。

下面的伪代码：

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        // Note: key is a word, values are in the form of
        // (filename=frequency)

        // sum up the number of files containing a particular word

        // for every filename=frequency in the value, compute tf-idf of this
        // word in filename and output (word@filename, tfidf)
    }
}

我读到无法循环values两次。一种替代方案可能是使用“缓存”，我试过，但结果出来很不稳定。

Answer 1

Text outputKey = new Text(); 
Text outputValue = new Text();

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        // Note: key is a word, values are in the form of
        // (filename=frequency)
        Map<String, Integer> tfs = new HashMap<>();
        for (Text value: values) {
            String[] valueParts = value.split("=");
            tfs.put(valueParts[0], Integer.parseInt(valueParts[1])); //do the necessary checks here
        }
        int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter() 
        double IDF = Math.log10((double)numDocs/tfs.keySet().size());

        // for every filename=frequency in the value, compute tf-idf of this
        // word in filename and output (word@filename, tfidf)
        for (String file : tfs.keySet()) {
            outputKey.set(key.toString()+"@"+file);
            outputValue.set(new String(tfs.get(file)*IDF)); //you could also set the outputValue to be a DoubleWritable
            context.write(outputKey, outputValue);
        }
    }
}

如果您将tf定义为frequency / maxFrequency，您可以在第一个循环中找到maxFrequency并相应地更改outputValue。

如果您想尝试单循环解决方案，则需要获取IDF，因此您需要获取输入values的编号。你可以使用以下方法在Java 8中完成这个技巧：

long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);

如this post中所述，或者按照同一帖子中不使用循环的其他建议（否则，您可以按照上一个答案）。

在这种情况下，您的代码将是（我没有尝试过）：

Text outputKey = new Text(); 
Text outputValue = new Text();

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter() 
        long DF = values.spliterator().getExactSizeIfKnown();
        double IDF = Math.log10((double)numDocs/DF);            

        // Note: key is a word, values are in the form of
        // (filename=frequency)
        for (Text value: values) {
            String[] valueParts = value.split("=");
            outputKey.set(key.toString()+"@"+valueParts[0]);
            outputValue.set(new String(Integer.parseInt(valueParts[1]) * IDF);
            context.write(outputKey, outputValue);
        }           

    }
}

这也会节省一些内存，因为您不需要额外的地图（如果有效的话）。

编辑：上面的代码假设您已经拥有文件名的每个单词的总频率，即相同的文件名不会多次出现在值中，但您可能需要检查它是否成立。否则，第二个解决方案不起作用，因为你必须在第一个循环中对每个文件进行总频率求和。

MapReduce迭代tf-idf计算的值

1 个答案: