Question

我将我的wordcount存储到HashMap的值字段中，如何才能获得文本中的500个顶级单词？

 public ArrayList<String> topWords (int numberOfWordsToFind, ArrayList<String> theText) {

        //ArrayList<String> frequentWords = new ArrayList<String>();

        ArrayList<String> topWordsArray= new ArrayList<String>();

        HashMap<String,Integer> frequentWords = new HashMap<String,Integer>();

        int wordCounter=0;

        for (int i=0; i<theText.size();i++){



                  if(frequentWords.containsKey(theText.get(i))){

                       //find value and increment
                      wordCounter=frequentWords.get(theText.get(i));
                      wordCounter++;
                      frequentWords.put(theText.get(i),wordCounter);

                  }

                else {
                  //new word
                  frequentWords.put(theText.get(i),1);

                }
        }


        for (int i=0; i<theText.size();i++){

            if (frequentWords.containsKey(theText.get(i))){
                 // what to write here?
                frequentWords.get(theText.get(i));

            }
        }
        return topWordsArray;
    }

Answer 1

您可能希望看到的另一种方法是以另一种方式思考：Map是否真的是正确的概念对象？可以认为这是对Java中被忽视的数据结构 bag 的良好使用。包就像集，但允许项目多次出现在集合中。这极大地简化了“添加找到的单词”。

Google的guava-libraries提供了一个Bag结构，虽然它被称为Multiset。使用Multiset，你可以为每个单词调用.add()一次，即使它已经在那里。但是，更容易，你可以抛弃你的循环：

Multiset<String> words = HashMultiset.create(theText);

现在你有一个Multiset，你做什么？好吧，你可以调用entrySet()，它会为你提供Multimap.Entry个对象的集合。然后，您可以将它们粘贴到List（它们来自Set），并使用Comparator对它们进行排序。完整代码可能看起来像（使用一些其他花哨的Guava功能来显示它们）：

Multiset<String> words = HashMultiset.create(theWords);

List<Multiset.Entry<String>> wordCounts = Lists.newArrayList(words.entrySet());
Collections.sort(wordCounts, new Comparator<Multiset.Entry<String>>() {
    public int compare(Multiset.Entry<String> left, Multiset.Entry<String> right) {
        // Note reversal of 'right' and 'left' to get descending order
        return right.getCount().compareTo(left.getCount());
    }
});
// wordCounts now contains all the words, sorted by count descending

// Take the first 50 entries (alternative: use a loop; this is simple because
// it copes easily with < 50 elements)
Iterable<Multiset.Entry<String>> first50 = Iterables.limit(wordCounts, 50);

// Guava-ey alternative: use a Function and Iterables.transform, but in this case
// the 'manual' way is probably simpler:
for (Multiset.Entry<String> entry : first50) {
    wordArray.add(entry.getElement());
}

你已经完成了！

Answer 2

Here您可以找到如何按值对HashMap进行排序的指南。排序后，您可以迭代前500个条目。

Answer 3

查看Apache Commons Collections包提供的TreeBidiMap。 http://commons.apache.org/collections/api-release/org/apache/commons/collections/bidimap/TreeBidiMap.html

它允许您根据键或值集对地图进行排序。

希望它有所帮助。

忠县

Java：通过HashMap获取文本中500个最常用的单词

3 个答案: