Question

我想在文档中找到多个标记字符串或短语的频率。它不是我要寻找的单词/单项频率，它总是多项而且术语的数量是动态的......

例如：在文档中搜索“与朋友一起说话”的频率！

非常感谢任何帮助/指针。

由于 Debjani

Answer 1

您可以使用Buffered Reader逐行阅读文档，然后使用split函数获取word / token的频率

int count=0;
while ((strLine = br.readLine()) != null)   {
     count+ = (strLine.split("words with friends").length-1);     
}
return count;

编辑：如果您想执行不区分大小写的搜索，则可以使用

Pattern myPattern = Pattern.compile("words with friends", Pattern.CASE_INSENSITIVE);
int count=0;
while ((strLine = br.readLine()) != null)   {
     count+ = (myPattern.split(strLine).length-1);    
}
return count;

Answer 2

为什么不使用正则表达式？正则表达式针对此类任务进行了优化。

http://download.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html

如何在java中查找文档内的短语（多个标记字符串）的频率？

2 个答案: