如何计算3个单词出现在其上的文档数量(java)

时间:2016-10-05 07:12:34

标签: java treemap information-retrieval

我构建了用于文件收集的倒排索引(wordTodocumentQueryMap)。它包含文件No和每个单词的频率

像这样:

experiment      1:1     17:1    30:1    39:1    52:1    109:2
*************
empirical       1:1     38:3    58:1    109:1   110:1   
*************
flow:           1:1     2:6     3:2     4:3     6:1      7:3     9:3     16:1   17:1  

现在我需要进行查询(几乎3个单词),结果应该是所有单词出现的文档。 (实验经验流程)的结果应该是

1 : 3 

其中1是文档否,3是查询词的加法项频率

但我的结果是:

1 : 3   2 : 6   3 : 2   4 : 3   6 : 1   7 : 3   9 : 3   16 : 1  17 : 2  

问题在于它列举了每个单词的所有文件

这是我到目前为止的代码

public static TreeMap<Integer, Integer> FileScore=new TreeMap<>();

中的

for(Map.Entry<String, Map<Integer,Integer>> wordTodocument : wordTodocumentQueryMap.entrySet())
    {
    Map<Integer, Integer> documentToFrecuency_value = wordTodocument.getValue();
        for(Map.Entry<Integer, Integer> documentToFrecuency : documentToFrecuency_value.entrySet())
            {
             int documentNo = documentToFrecuency.getKey();
             int wordCount = documentToFrecuency.getValue();
             int score=getScore(documentNo);

                 FileScore.put(documentNo, score+wordCount);
         }

    }

//print the score

for(Map.Entry<Integer,Integer> FileToScore : FileScore.entrySet())
{
       int documentNo = FileToScore.getKey();
       int Score = FileToScore.getValue();
       System.out.print( documentNo +" : "+ Score+"\t");

    }


public static int getScore (int fileno){
if(FileScore.containsKey(fileno))
    return FileScore.get(fileno);
return 0;
}

1 个答案:

答案 0 :(得分:0)

以下方法应该这样做。

/**
 * Finds docuiments where all the given words appear.
 * 
 * @param wordTodocumentQueryMap For each word maps file no. to frequency > 0
 * @param firstWord 
 * @param otherWords
 * @return A frequency map containing file no. of files containing all of fisrtWord and otherWords mapped
 *         to a sum of counts for the words.
 */
public static Map<Integer, Integer> docsWithAllWords(Map<String, Map<Integer, Integer>> wordTodocumentQueryMap,
        String firstWord, String... otherWords) {
    // result
    Map<Integer, Integer> fileScore = new TreeMap<>();
    Map<Integer, Integer> firstWordCounts = wordTodocumentQueryMap.get(firstWord);
    if (firstWordCounts == null) { // first word not found in any doc
        // return empty result
        return fileScore;
    }
    outer: for (Map.Entry<Integer, Integer> firstWordCountsEntry : firstWordCounts.entrySet()) {
        Integer docNo = firstWordCountsEntry.getKey();
        int sumOfCounts = firstWordCountsEntry.getValue();
        // find out if both/all other words are in doc, and sum counts
        for (String word : otherWords) {
            Map<Integer, Integer> wordCountEntry = wordTodocumentQueryMap.get(word);
            if (wordCountEntry == null) {
                return fileScore;
            }
            Integer wordCount = wordCountEntry.get(docNo);
            if (wordCount == null) { // word not found in doc
                continue outer;
            }
            sumOfCounts += wordCount;
        }
        fileScore.put(docNo, sumOfCounts);
    }
    return fileScore;
}

它在Java中很少使用:标签outer。如果您发现它太不寻常(或者只是不喜欢continue语句),您可能会重写为使用布尔值。现在你可以打电话给

docsWithAllWords(wordTodocumentQueryMap, "experiment", "empirical", "flow")

它将为您提供1 : 3而无其他任何内容。