我正在写一个关于java的反向索引程序,它返回多个文档中的术语频率。我已经能够返回整个集合中出现的单词的次数,但是我无法返回单词出现的文档。这是我到目前为止的代码:
import java.util.*; // Provides TreeMap, Iterator, Scanner
import java.io.*; // Provides FileReader, FileNotFoundException
public class Run
{
public static void main(String[ ] args)
{
// **THIS CREATES A TREE MAP**
TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>( );
Map[] mapArray = new Map[5];
mapArray[0] = new HashMap<String, Integer>();
readWordFile(frequencyData);
printAllCounts(frequencyData);
}
public static int getCount(String word, TreeMap<String, Integer> frequencyData)
{
if (frequencyData.containsKey(word))
{ // The word has occurred before, so get its count from the map
return frequencyData.get(word); // Auto-unboxed
}
else
{ // No occurrences of this word
return 0;
}
}
public static void printAllCounts(TreeMap<String, Integer> frequencyData)
{
System.out.println("-----------------------------------------------");
System.out.println(" Occurrences Word");
for(String word : frequencyData.keySet( ))
{
System.out.printf("%15d %s\n", frequencyData.get(word), word);
}
System.out.println("-----------------------------------------------");
}
public static void readWordFile(TreeMap<String, Integer> frequencyData)
{
int total = 0;
Scanner wordFile;
String word; // A word read from the file
Integer count; // The number of occurrences of the word
int counter = 0;
int docs = 0;
//**FOR LOOP TO READ THE DOCUMENTS**
for(int x=0; x<Docs.length; x++)
{ //start of for loop [*
try
{
wordFile = new Scanner(new FileReader(Docs[x]));
}
catch (FileNotFoundException e)
{
System.err.println(e);
return;
}
while (wordFile.hasNext( ))
{
// Read the next word and get rid of the end-of-line marker if needed:
word = wordFile.next( );
// This makes the Word lower case.
word = word.toLowerCase();
word = word.replaceAll("[^a-zA-Z0-9\\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = getCount(word, frequencyData) + 1;
frequencyData.put(word, count);
total = total + count;
counter++;
docs = x + 1;
}
} //End of for loop *]
System.out.println("There are " + total + " terms in the collection.");
System.out.println("There are " + counter + " unique terms in the collection.");
System.out.println("There are " + docs + " documents in the collection.");
}
// Array of documents
static String Docs [] = {"words.txt", "words2.txt",};
答案 0 :(得分:2)
不是简单地将单词从单词映射到计数,而是从每个单词创建一个Map到从文档到count的嵌套Map。换句话说:
Map<String, Map<String, Integer>> wordToDocumentMap;
然后,在记录计数的循环中,您希望使用如下所示的代码:
Map<String, Integer> documentToCountMap = wordToDocumentMap.get(currentWord);
if(documentToCountMap == null) {
// This word has not been found anywhere before,
// so create a Map to hold document-map counts.
documentToCountMap = new TreeMap<>();
wordToDocumentMap.put(currentWord, documentToCountMap);
}
Integer currentCount = documentToCountMap.get(currentDocument);
if(currentCount == null) {
// This word has not been found in this document before, so
// set the initial count to zero.
currentCount = 0;
}
documentToCountMap.put(currentDocument, currentCount + 1);
现在,您要按每个单词和每个文档捕获计数。
完成分析并想要打印结果摘要后,您可以像这样浏览地图:
for(Map.Entry<String, Map<String,Integer>> wordToDocument :
wordToDocumentMap.entrySet()) {
String currentWord = wordToDocument.getKey();
Map<String, Integer> documentToWordCount = wordToDocument.getValue();
for(Map.Entry<String, Integer> documentToFrequency :
documentToWordCount.entrySet()) {
String document = documentToFrequency.getKey();
Integer wordCount = documentToFrequency.getValue();
System.out.println("Word " + currentWord + " found " + wordCount +
" times in document " + document);
}
}
有关Java中for-each结构的说明,请参阅this tutorial page。
要详细了解Map界面的功能,包括entrySet
方法,请参阅this tutorial page。
答案 1 :(得分:1)
尝试添加第二张地图word -> set of document name
,如下所示:
Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();
...
word = word.replaceAll("[^a-zA-Z0-9\\s]", "");
// Get the current count of this word, add one, and then store the new count:
count = getCount(word, frequencyData) + 1;
frequencyData.put(word, count);
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
total = total + count;
counter++;
docs = x + 1;
当您需要获取一组遇到特定字词的文件名时,您只需get()
来自地图filenames
。下面是打印出所有文件名的示例,其中我们遇到了一个单词:
public static void printAllCounts(TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames) {
System.out.println("-----------------------------------------------");
System.out.println(" Occurrences Word");
for(String word : frequencyData.keySet( ))
{
System.out.printf("%15d %s\n", frequencyData.get(word), word);
for (String filename : filenames.get(word)) {
System.out.println(filename);
}
}
System.out.println("-----------------------------------------------");
}
答案 2 :(得分:0)
我已将扫描仪放入主要方法中,而我搜索的单词将返回单词出现的文档。我还会返回单词出现的次数,但我只能将其作为所有三个文件中的总次数。我想让它返回每个文档中出现的次数。我希望这能够计算tf-idf,如果你对整个tf-idf有一个完整的答案,我将不胜感激。干杯
这是我的代码:
grant_type: password
scope: read write
username: foo
password: bar
client_id: android-client
client_secret: android-secret