字符串频率搜索未找到所有单词

时间:2018-11-27 23:22:46

标签: java string text frequency word-frequency

我正在尝试实现一种字符串频率搜索算法,该算法可解析jokes.txt文件并获取测试中每个唯一单词的出现次数。 该算法应考虑大小写敏感性,并使“ a”和“ A”都唯一。截至目前,该算法似乎跳过了测试中“ a”的首次出现,之后又跳过了许多其他单词。

此外,words数组包含文本中的每个单词。不知何故,(!isDuplicate)条件内的循环会跳过“ a”并且不会增加count

jokes.txt

I wondered why the baseball was getting bigger.
Then it hit me.

Police were called to a day care
where a 3-yr-old was resisting a rest.
...

WordCounter.java

import java.util.*;
import java.io.FileNotFoundException;
import java.io.FileInputStream;

public class WordCounter {
    ArrayList<String> words = new ArrayList<String>();

    //prints number of words in the  file
    public void numOfWords(Scanner key1) {
        int counter = 1;
        while(key1.hasNext()) {
            words.add(key1.next().replaceAll("[^a-zA-Z]", ""));

        }
    }

    //Takes word as parameter and returns frequency of that word
    public void frequencyCounter(Scanner key1) {
        ArrayList <String> freqWords = new ArrayList<String>();
        int count = 1;
        int counter = 1;

        for(int i = 0; i < words.size(); i++){
            boolean isDuplicate = false;
            for (String s: freqWords){
                if (s.contains(words.get(i).trim()))
                    isDuplicate =true;
            }

            if (!isDuplicate){

                for(int j = i + 1; j < words.size(); j++){
                    if(words.get(i).equals(words.get(j))){
                        count++;
                    }
                }
                freqWords.add(count + "-" + words.get(i));
                Collections.sort(freqWords, Collections.reverseOrder());
                count = 1;     
            }
        }

        for(int i = 0; i < freqWords.size(); i++) {
            System.out.print((i+1) + "       ");
            System.out.println(freqWords.get(i));
        }
    }

}

2 个答案:

答案 0 :(得分:2)

您用于确定重复项的逻辑有点不正确:

        boolean isDuplicate = false;
        for (String s: freqWords){
            if (s.contains(words.get(i).trim()))
                isDuplicate =true;
        }

如果word.get(i)为“ a”而s为“ apple”,则这将使isDuplicate为true,因为apple包含“ a”。检查s中的单词是否与words.get(i)完全匹配。

答案 1 :(得分:0)

只需编辑我的错误答案:

但是可能是contains()导致了问题,因为API告诉我们它在字符串中搜索Charsequenz。这意味着您基本上是在每个单词中搜索Charsequenz“ a”并告诉它是重复的。因此,它将“天”计算为一个,因为您要搜索“ a”

我认为最好使用HashMap搜索重复项,并且速度更快。您可以计算出值中有多少。