我必须制作一个程序来处理链表中的单词频率,并输出如下结果: 单词,出现次数,频率百分比
import java.io.File;
import java.io.FileNotFoundException;
import java.util.*;
public class Link {
public static void main(String args[]) {
long start = System.currentTimeMillis();
LinkedList<String> list = new LinkedList<String>();
File file = new File("words.txt");
try {
Scanner sc = new Scanner(file);
String words;
while (sc.hasNext()) {
words = sc.next();
words = words.replaceAll("[^a-zA-Z0-9]", "");
words = words.toLowerCase();
words = words.trim();
list.add(words);
}
sc.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
Map<String, Integer> frequency = new TreeMap<String, Integer>();
for (String count : list) {
if (frequency.containsKey(count)) {
frequency.put(count, frequency.get(count) + 1);
} else {
frequency.put(count, 1);
}
}
System.out.println(frequency);
long end = System.currentTimeMillis();
System.out.println("\n" + "Duration: " + (end - start) + " ms");
}
}
输出:{a = 1,ab = 3,abbc = 1,asd = 2,xyz = 1}
我不知道的是如何以百分比表示频率并忽略短于2个字符的单词。例如,应忽略“a = 1”。
提前致谢。
答案 0 :(得分:4)
首先,引入一个double
变量来跟踪发生的总数。 E.g。
double total = 0;
接下来是使用String
筛选出任何length() < 2
。您可以在将其添加到LinkedList
之前执行此操作。
while (sc.hasNext()) {
words = sc.next();
words = words.replaceAll("[^a-zA-Z0-9]", "");
words = words.toLowerCase();
words = words.trim();
if (words.length() >= 2) list.add(words); //Filter out strings < 2 chars
}
现在,在查看String
时,我们应该为total
增加1
变量,因为每次出现都是如此;
for (String count : list) {
if (frequency.containsKey(count)) {
frequency.put(count, frequency.get(count) + 1);
} else {
frequency.put(count, 1);
}
total++; //Increase total number of occurences
}
然后我们可以使用System.out.printf()
将它打印出来。
for (Map.Entry<String, Integer> entry: frequency.entrySet()) {
System.out.printf("String: %s \t Occurences: %d \t Percentage: %.2f%%%n", entry.getKey(), entry.getValue(), entry.getValue()/total*100);
}
请注意,一旦您处理大型printf
,或者发生了大量事件,这将看起来不太好(String
语句)。因此,您可以选择执行以下操作,因为maxLength
包含列表中length()
的最大String
,而occLength
包含最大发生的数字位数。
for (Map.Entry<String, Integer> entry: frequency.entrySet()) {
System.out.printf("String: %" + maxLength + "s Occurences: %" + occLength + "d Percentage: %.2f%%%n", entry.getKey(), entry.getValue(), entry.getValue()/total*100);
}
答案 1 :(得分:1)
在添加到地图步骤时忽略大小小于2的字符串,并维护合法字计数器以计算百分比。
int legalWords = 0;
for (String count: list) {
if (count.size() >= 2) {
if (frequency.containsKey(count)) {
frequency.put(count, frequency.get(count) + 1);
} else {
frequency.put(count, 1);
}
legalWords++;
}
}
for (Map.Entry < String, String > entry: map.entrySet()) {
System.out.println(entry.getKey() + " " + entry.getValue() + " " + (entry.getValue() / (double) legalWords) * 100.0 + "%");
}
答案 2 :(得分:0)
注意:由于OP问题没有向我们提供详细信息,因此我们假设我们将计算一个字符的单词,但我们不会输出它们。
从您的主要课程中分离您的逻辑:
class WordStatistics {
private String word;
private long occurrences;
private float frequency;
public WordStatistics(String word){
this.word=word;
}
public WordStatistics calculateOccurrences(List<String> words) {
this.occurrences = words.stream()
.filter(p -> p.equalsIgnoreCase(this.word)).count();
return this;
}
public WordStatistics calculateFrequency(List<String> words) {
this.frequency = (float) this.occurrences / words.size() * 100;
return this;
}
// getters and setters
}
<强>解释强>
考虑这个单词列表:
List<String> words = Arrays.asList("Java", "C++", "R", "php", "Java",
"C", "Java", "C#", "C#","Java","R");
使用java 8 Streams API计算word
中words
的出现次数:
words.stream()
.filter(p -> p.equalsIgnoreCase(word)).count();
计算单词的频率百分比:
frequency = (float) occurrences / words.size() * 100;
设置你的话语&#39;统计(出现次数+频率):
List<WordStatistics> wordsStatistics = new LinkedList<WordStatistics>();
words.stream()
.distinct()
.forEach(
word -> wordsStatistics.add(new WordStatistics(word)
.calculateOccurrences(words)
.calculateFrequency(words)));
输出结果,忽略了一个字符的单词:
wordsStatistics
.stream()
.filter(word -> word.getWord().length() > 1)
.forEach(
word -> System.out.printf("Word : %s \t"
+ "Occurences : %d \t"
+ "Frequency : %.2f%% \t\n", word.getWord(),
word.getOccurrences(), word.getFrequency()));
输出:
Word : C# Occurences : 2 Frequency : 18.18%
Word : Java Occurences : 4 Frequency : 36.36%
Word : C++ Occurences : 1 Frequency : 9.09%
Word : php Occurences : 1 Frequency : 9.09%