单词频率百分比java

时间:2015-05-05 12:42:29

标签: java

我必须制作一个程序来处理链表中的单词频率,并输出如下结果: 单词,出现次数,频率百分比

import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;

public class Link {

    public static void main(String args[]) {

    long start = System.currentTimeMillis();

    LinkedList<String> list = new LinkedList<String>();

    File file = new File("words.txt");

    try {

        Scanner sc = new Scanner(file);

        String words;

        while (sc.hasNext()) {
            words = sc.next();
            words = words.replaceAll("[^a-zA-Z0-9]", "");
            words = words.toLowerCase();
            words = words.trim();
            list.add(words);
        }

        sc.close();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }

    Map<String, Integer> frequency = new TreeMap<String, Integer>();

    for (String count : list) {
        if (frequency.containsKey(count)) {
            frequency.put(count, frequency.get(count) + 1);
        } else {
            frequency.put(count, 1);
        }
    }

    System.out.println(frequency);

    long end = System.currentTimeMillis();

    System.out.println("\n" + "Duration: " + (end - start) + " ms");
    }
}

输出:{a = 1,ab = 3,abbc = 1,asd = 2,xyz = 1}

我不知道的是如何以百分比表示频率并忽略短于2个字符的单词。例如,应忽略“a = 1”。

提前致谢。

3 个答案:

答案 0 :(得分:4)

首先,引入一个double变量来跟踪发生的总数。 E.g。

double total = 0;

接下来是使用String筛选出任何length() < 2。您可以在将其添加到LinkedList之前执行此操作。

while (sc.hasNext()) {
    words = sc.next();
    words = words.replaceAll("[^a-zA-Z0-9]", "");
    words = words.toLowerCase();
    words = words.trim();
    if (words.length() >= 2) list.add(words); //Filter out strings < 2 chars
}

现在,在查看String时,我们应该为total增加1变量,因为每次出现都是如此;

for (String count : list) {
    if (frequency.containsKey(count)) {
        frequency.put(count, frequency.get(count) + 1);
    } else {
        frequency.put(count, 1);
    }
    total++; //Increase total number of occurences
}

然后我们可以使用System.out.printf()将它打印出来。

for (Map.Entry<String, Integer> entry: frequency.entrySet()) {
    System.out.printf("String: %s \t Occurences: %d \t Percentage: %.2f%%%n", entry.getKey(), entry.getValue(), entry.getValue()/total*100);
}


Example

请注意,一旦您处理大型printf,或者发生了大量事件,这将看起来不太好(String语句)。因此,您可以选择执行以下操作,因为maxLength包含列表中length()的最大String,而occLength包含最大发生的数字位数。

for (Map.Entry<String, Integer> entry: frequency.entrySet()) {
    System.out.printf("String: %" + maxLength + "s  Occurences: %" + occLength + "d  Percentage: %.2f%%%n", entry.getKey(), entry.getValue(), entry.getValue()/total*100);
}


Example

答案 1 :(得分:1)

在添加到地图步骤时忽略大小小于2的字符串,并维护合法字计数器以计算百分比。

int legalWords = 0;
for (String count: list) {
    if (count.size() >= 2) {
        if (frequency.containsKey(count)) {
            frequency.put(count, frequency.get(count) + 1);
        } else {
            frequency.put(count, 1);
        }
        legalWords++;
    }
}
for (Map.Entry < String, String > entry: map.entrySet()) {
    System.out.println(entry.getKey() + " " + entry.getValue() + " " + (entry.getValue() / (double) legalWords) * 100.0 + "%");
}

答案 2 :(得分:0)

注意:由于OP问题没有向我们提供详细信息,因此我们假设我们将计算一个字符的单词,但我们不会输出它们。

从您的主要课程中分离您的逻辑:

class WordStatistics {
    private String word;
    private long occurrences;
    private float frequency;

    public WordStatistics(String word){
        this.word=word;
    }

    public WordStatistics calculateOccurrences(List<String> words) {
        this.occurrences = words.stream()
                .filter(p -> p.equalsIgnoreCase(this.word)).count();
        return this;
   }

    public WordStatistics calculateFrequency(List<String> words) {
        this.frequency = (float) this.occurrences / words.size() * 100;
        return this;
    }

    // getters and setters

}

<强>解释

考虑这个单词列表:

List<String> words = Arrays.asList("Java", "C++", "R", "php", "Java",
        "C", "Java", "C#", "C#","Java","R");

使用java 8 Streams API计算wordwords的出现次数:

   words.stream()
            .filter(p -> p.equalsIgnoreCase(word)).count();

计算单词的频率百分比:

  frequency = (float) occurrences / words.size() * 100;

设置你的话语&#39;统计(出现次数+频率):

List<WordStatistics> wordsStatistics = new LinkedList<WordStatistics>();

    words.stream()
            .distinct()
            .forEach(
                    word -> wordsStatistics.add(new WordStatistics(word)
                            .calculateOccurrences(words)
                            .calculateFrequency(words)));

输出结果,忽略了一个字符的单词:

    wordsStatistics
            .stream()
            .filter(word -> word.getWord().length() > 1)
            .forEach(
                    word -> System.out.printf("Word : %s \t"
                            + "Occurences : %d \t"
                            + "Frequency : %.2f%% \t\n", word.getWord(),
                            word.getOccurrences(), word.getFrequency()));

输出:

Word : C#       Occurences : 2  Frequency : 18.18%  
Word : Java     Occurences : 4  Frequency : 36.36%  
Word : C++      Occurences : 1  Frequency : 9.09%   
Word : php      Occurences : 1  Frequency : 9.09%