我的程序将读入一段单词(存储在文本文件中)。然后需要执行以下操作:
限制:我无法使用Collections
类,我无法多次存储数据。 (例如,从段落中读取单词并将它们存储到Set和ArrayList中)
编码这不会很难,但我无法弄清楚什么是最有效的实现,因为数据大小可能是维基百科文章中的几段。这就是我现在的想法:
int
来记录行号。这些行号将相应更新。这有点不完整,但这正是我现在所想的。整个' Word'也可能完全没必要上课。
答案 0 :(得分:2)
您可以使用简单的TreeMap<String, Integer>
进行频率查找。
查找应为O(1),因为单词很短(即您会找到正常文本)。如果您预计会有大量不成功的查找(大量搜索不存在的单词),您可以使用Bloom过滤器进行预过滤。
我从一个简单的实现开始,并在需要时进一步优化(直接解析流,而不是用分隔符拆分每一行并重复)。
答案 1 :(得分:2)
首先,您可以创建一个类,用于保存事件和行号(以及单词)的数据。这个类可以实现Comparable
接口,提供基于单词频率的简单比较:
public class WordOccurrence implements Comparable<WordOccurrence> {
private final String word;
private int totalCount = 0;
private Set<Integer> lineNumbers = new TreeSet<>();
public WordOccurrence(String word, int firstLineNumber) {
this.word = word;
addOccurrence(firstLineNumber);
}
public final void addOccurrence(int lineNumber) {
totalCount++;
lineNumbers.add(lineNumber);
}
@Override
public int compareTo(WordOccurrence o) {
return totalCount - o.totalCount;
}
@Override
public String toString() {
StringBuilder lineNumberInfo = new StringBuilder("[");
for (int line : lineNumbers) {
if (lineNumberInfo.length() > 1) {
lineNumberInfo.append(", ");
}
lineNumberInfo.append(line);
}
lineNumberInfo.append("]");
return word + ", occurences: " + totalCount + ", on rows "
+ lineNumberInfo.toString();
}
}
从文件中读取文字时,在Map<String, WordOccurrence>
中返回数据,将字词映射到WordOccurrence
s非常有用。使用TreeMap
,您将获得按字母排序&#34;免费&#34;。此外,您可能希望从行中删除标点符号(例如,使用像\\p{P}
这样的正则表达式)并忽略单词的大小写:
public TreeMap<String, WordOccurrence> countOccurrences(String filePath)
throws IOException {
TreeMap<String, WordOccurrence> words = new TreeMap<>();
File file = new File(filePath);
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(file)));
String line = null;
int lineNumber = 0;
while ((line = reader.readLine()) != null) {
// remove punctuation and normalize to lower-case
line = line.replaceAll("\\p{P}", "").toLowerCase();
lineNumber++;
String[] tokens = line.split("\\s+");
for (String token : tokens) {
if (words.containsKey(token)) {
words.get(token).addOccurrence(lineNumber);
} else {
words.put(token, new WordOccurrence(token, lineNumber));
}
}
}
return words;
}
使用上面的代码按字母顺序显示事件就像
一样简单for (Map.Entry<String, WordOccurrence> entry :
countOccurrences("path/to/file").entrySet()) {
System.out.println(entry.getValue());
}
如果您无法使用Collections.sort()
(以及Comparator<WordOccurrence>
)按事件排序,则需要自行编写排序。这样的事情应该这样做:
public static void displayInOrderOfOccurrence(
Map<String, WordOccurrence> words) {
List<WordOccurrence> orderedByOccurrence = new ArrayList<>();
// sort
for (Map.Entry<String, WordOccurrence> entry : words.entrySet()) {
WordOccurrence wo = entry.getValue();
// initialize the list on the first round
if (orderedByOccurrence.isEmpty()) {
orderedByOccurrence.add(wo);
} else {
for (int i = 0; i < orderedByOccurrence.size(); i++) {
if (wo.compareTo(orderedByOccurrence.get(i)) > 0) {
orderedByOccurrence.add(i, wo);
break;
} else if (i == orderedByOccurrence.size() - 1) {
orderedByOccurrence.add(wo);
break;
}
}
}
}
// display
for (WordOccurrence wo : orderedByOccurence) {
System.out.println(wo);
}
}
使用以下测试数据运行上述代码:
Potato; orange. Banana; apple, apple; potato. Potato.
将产生此输出:
apple, occurrences: 2, on rows [2] banana, occurrences: 1, on rows [2] orange, occurrences: 1, on rows [1] potato, occurrences: 3, on rows [1, 2, 3] potato, occurrences: 3, on rows [1, 2, 3] apple, occurrences: 2, on rows [2] banana, occurrences: 1, on rows [2] orange, occurrences: 1, on rows [1]
答案 2 :(得分:0)
你可以拥有这样的结构: https://gist.github.com/jeorfevre/946ede55ad93cc811cf8
/**
*
* @author Jean-Emmanuel je@Rizze.com
*
*/
public class WordsIndex{
HashMap<String, Word> words = new HashMap<String, Word>();
public static void put(String word, int line, int paragraph){
word=word.toLowerCase();
if(words.containsKey(word)){
Word w=words.get(word);
w.count++;
}else{
//new word
Word w = new Word();
w.count=1;
w.line=line;
w.paragraph=paragraph;
w.word=word;
words.put(word, w);
}
}
}
public class Word{
String word;
int count;
int line;
int paragraph;
}
享受
答案 3 :(得分:0)
你可以使用TreeMap它非常适合获取订购的数据。使用您的单词作为键,频率作为值。例如,让以下是你的段落
Java是优秀的语言Java是面向对象的 所以我将执行以下操作以存储每个单词及其频率
String s = "Java is good language Java is object oriented" ;
String strArr [] = s.split(" ") ;
TreeMap<String, Integer> tm = new TreeMap<String, Integer>();
for(String str : strArr){
if(tm.get(str) == null){
tm.put(str, 1) ;
}else{
int count = tm.get(str) ;
count+=1 ;
}
}
希望这会对你有所帮助