我构建了用于文件收集的倒排索引(wordTodocumentQueryMap)。它包含文件No和每个单词的频率
像这样:experiment 1:1 17:1 30:1 39:1 52:1 109:2
*************
empirical 1:1 38:3 58:1 109:1 110:1
*************
flow: 1:1 2:6 3:2 4:3 6:1 7:3 9:3 16:1 17:1
现在我需要进行查询(几乎3个单词),结果应该是所有单词出现的文档。 (实验经验流程)的结果应该是
1 : 3
其中1是文档否,3是查询词的加法项频率
但我的结果是:
1 : 3 2 : 6 3 : 2 4 : 3 6 : 1 7 : 3 9 : 3 16 : 1 17 : 2
问题在于它列举了每个单词的所有文件
这是我到目前为止的代码
public static TreeMap<Integer, Integer> FileScore=new TreeMap<>();
主中的
for(Map.Entry<String, Map<Integer,Integer>> wordTodocument : wordTodocumentQueryMap.entrySet())
{
Map<Integer, Integer> documentToFrecuency_value = wordTodocument.getValue();
for(Map.Entry<Integer, Integer> documentToFrecuency : documentToFrecuency_value.entrySet())
{
int documentNo = documentToFrecuency.getKey();
int wordCount = documentToFrecuency.getValue();
int score=getScore(documentNo);
FileScore.put(documentNo, score+wordCount);
}
}
//print the score
for(Map.Entry<Integer,Integer> FileToScore : FileScore.entrySet())
{
int documentNo = FileToScore.getKey();
int Score = FileToScore.getValue();
System.out.print( documentNo +" : "+ Score+"\t");
}
public static int getScore (int fileno){
if(FileScore.containsKey(fileno))
return FileScore.get(fileno);
return 0;
}
答案 0 :(得分:0)
以下方法应该这样做。
/**
* Finds docuiments where all the given words appear.
*
* @param wordTodocumentQueryMap For each word maps file no. to frequency > 0
* @param firstWord
* @param otherWords
* @return A frequency map containing file no. of files containing all of fisrtWord and otherWords mapped
* to a sum of counts for the words.
*/
public static Map<Integer, Integer> docsWithAllWords(Map<String, Map<Integer, Integer>> wordTodocumentQueryMap,
String firstWord, String... otherWords) {
// result
Map<Integer, Integer> fileScore = new TreeMap<>();
Map<Integer, Integer> firstWordCounts = wordTodocumentQueryMap.get(firstWord);
if (firstWordCounts == null) { // first word not found in any doc
// return empty result
return fileScore;
}
outer: for (Map.Entry<Integer, Integer> firstWordCountsEntry : firstWordCounts.entrySet()) {
Integer docNo = firstWordCountsEntry.getKey();
int sumOfCounts = firstWordCountsEntry.getValue();
// find out if both/all other words are in doc, and sum counts
for (String word : otherWords) {
Map<Integer, Integer> wordCountEntry = wordTodocumentQueryMap.get(word);
if (wordCountEntry == null) {
return fileScore;
}
Integer wordCount = wordCountEntry.get(docNo);
if (wordCount == null) { // word not found in doc
continue outer;
}
sumOfCounts += wordCount;
}
fileScore.put(docNo, sumOfCounts);
}
return fileScore;
}
它在Java中很少使用:标签outer
。如果您发现它太不寻常(或者只是不喜欢continue
语句),您可能会重写为使用布尔值。现在你可以打电话给
docsWithAllWords(wordTodocumentQueryMap, "experiment", "empirical", "flow")
它将为您提供1 : 3
而无其他任何内容。