Java反向索引程序

时间:2014-05-03 20:43:33

标签: java inverted-index treemaps

我正在写一个关于java的反向索引程序,它返回多个文档中的术语频率。我已经能够返回整个集合中出现的单词的次数,但是我无法返回单词出现的文档。这是我到目前为止的代码:

import java.util.*;  // Provides TreeMap, Iterator, Scanner  
import java.io.*;    // Provides FileReader, FileNotFoundException  

public class Run
{
    public static void main(String[ ] args)
    {
        // **THIS CREATES A TREE MAP**  
        TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>( );

        Map[] mapArray = new Map[5];
        mapArray[0] = new HashMap<String, Integer>();

        readWordFile(frequencyData);
        printAllCounts(frequencyData);
    }


    public static int getCount(String word, TreeMap<String, Integer> frequencyData)
    {
        if (frequencyData.containsKey(word))
        {  // The word has occurred before, so get its count from the map  
            return frequencyData.get(word); // Auto-unboxed  
        }
        else
        {  // No occurrences of this word  
            return 0;
        }
    }


    public static void printAllCounts(TreeMap<String, Integer> frequencyData)
    {
        System.out.println("-----------------------------------------------");
        System.out.println("    Occurrences    Word");

        for(String word : frequencyData.keySet( ))
        {
            System.out.printf("%15d    %s\n", frequencyData.get(word), word);
        }

        System.out.println("-----------------------------------------------");
    }


    public static void readWordFile(TreeMap<String, Integer> frequencyData)
    {
        int total = 0;
        Scanner wordFile;
        String word;     // A word read from the file  
        Integer count;   // The number of occurrences of the word
        int counter = 0;
        int docs = 0;

        //**FOR LOOP TO READ THE DOCUMENTS**  
        for(int x=0; x<Docs.length; x++)
        { //start of for loop [*  

            try
            {
                wordFile = new Scanner(new FileReader(Docs[x]));
            }
            catch (FileNotFoundException e)
            {
                System.err.println(e);
                return;
            }

            while (wordFile.hasNext( ))
            {
                // Read the next word and get rid of the end-of-line marker if needed:  
                word = wordFile.next( );

                // This makes the Word lower case.  
                word = word.toLowerCase();

                word = word.replaceAll("[^a-zA-Z0-9\\s]", "");

                // Get the current count of this word, add one, and then store the new count:  
                count = getCount(word, frequencyData) + 1;
                frequencyData.put(word, count);
                total = total + count;
                counter++;
                docs = x + 1;

            }

        } //End of for loop *]  
        System.out.println("There are " + total + " terms in the collection.");
        System.out.println("There are " + counter + " unique terms in the collection.");
        System.out.println("There are " + docs + " documents in the collection.");

    }


    // Array of documents  
    static String Docs [] = {"words.txt", "words2.txt",};

3 个答案:

答案 0 :(得分:2)

不是简单地将单词从单词映射到计数,而是从每个单词创建一个Map到从文档到count的嵌套Map。换句话说:

Map<String, Map<String, Integer>> wordToDocumentMap;

然后,在记录计数的循环中,您希望使用如下所示的代码:

Map<String, Integer> documentToCountMap = wordToDocumentMap.get(currentWord);
if(documentToCountMap == null) {
    // This word has not been found anywhere before,
    // so create a Map to hold document-map counts.
    documentToCountMap = new TreeMap<>();
    wordToDocumentMap.put(currentWord, documentToCountMap);
}
Integer currentCount = documentToCountMap.get(currentDocument);
if(currentCount == null) {
    // This word has not been found in this document before, so
    // set the initial count to zero.
    currentCount = 0;
}
documentToCountMap.put(currentDocument, currentCount + 1);

现在,您要按每个单词和每个文档捕获计数。

完成分析并想要打印结果摘要后,您可以像这样浏览地图:

for(Map.Entry<String, Map<String,Integer>> wordToDocument :
        wordToDocumentMap.entrySet()) {
    String currentWord = wordToDocument.getKey();
    Map<String, Integer> documentToWordCount = wordToDocument.getValue();
    for(Map.Entry<String, Integer> documentToFrequency :
            documentToWordCount.entrySet()) {
        String document = documentToFrequency.getKey();
        Integer wordCount = documentToFrequency.getValue();
        System.out.println("Word " + currentWord + " found " + wordCount +
                " times in document " + document);
    }
}

有关Java中for-each结构的说明,请参阅this tutorial page

要详细了解Map界面的功能,包括entrySet方法,请参阅this tutorial page

答案 1 :(得分:1)

尝试添加第二张地图word -> set of document name,如下所示:

Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();

...
word = word.replaceAll("[^a-zA-Z0-9\\s]", ""); 

// Get the current count of this word, add one, and then store the new count:  
count = getCount(word, frequencyData) + 1;  
frequencyData.put(word, count);
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
    filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
total = total + count;
counter++;
docs = x + 1;

当您需要获取一组遇到特定字词的文件名时,您只需get()来自地图filenames。下面是打印出所有文件名的示例,其中我们遇到了一个单词:

public static void printAllCounts(TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames) {
    System.out.println("-----------------------------------------------");
    System.out.println("    Occurrences    Word");

    for(String word : frequencyData.keySet( ))
    {
        System.out.printf("%15d    %s\n", frequencyData.get(word), word);
        for (String filename : filenames.get(word)) {
            System.out.println(filename);
        } 
    }

    System.out.println("-----------------------------------------------");
}

答案 2 :(得分:0)

我已将扫描仪放入主要方法中,而我搜索的单词将返回单词出现的文档。我还会返回单词出现的次数,但我只能将其作为所有三个文件中的总次数。我想让它返回每个文档中出现的次数。我希望这能够计算tf-idf,如果你对整个tf-idf有一个完整的答案,我将不胜感激。干杯

这是我的代码:

grant_type: password
scope: read write
username: foo
password: bar
client_id: android-client
client_secret: android-secret