我正在尝试计算文本文件中单词的频率。但我必须采用不同的方法。例如,如果文件包含BRAIN-ISCHEMIA和ISCHEMIA-BRAIN,我需要计算两次BRAIN-ISCHEMIA(并离开ISCHEMIA-BRAIN),反之亦然。这是我的代码 -
// Mapping of String->Integer (word -> frequency)
HashMap<String, Integer> frequencyMap = new HashMap<String, Integer>();
// Iterate through each line of the file
String[] temp;
String currentLine;
String currentLine2;
while ((currentLine = in.readLine()) != null) {
// Remove this line if you want words to be case sensitive
currentLine = currentLine.toLowerCase();
temp=currentLine.split("-");
currentLine2=temp[1]+"-"+temp[0];
// Iterate through each word of the current line
// Delimit words based on whitespace, punctuation, and quotes
StringTokenizer parser = new StringTokenizer(currentLine);
while (parser.hasMoreTokens()) {
String currentWord = parser.nextToken();
Integer frequency = frequencyMap.get(currentWord);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord, frequency + 1);
}
StringTokenizer parser2 = new StringTokenizer(currentLine2);
while (parser2.hasMoreTokens()) {
String currentWord2 = parser2.nextToken();
Integer frequency = frequencyMap.get(currentWord2);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord2, frequency + 1);
}
}
// Display our nice little Map
System.out.println(frequencyMap);
但对于以下文件 -
缺血谷氨酸 缺血性脑 谷氨酸脑 脑容忍 脑容忍 容忍脑 谷氨酸缺血 缺血谷氨酸
我得到以下输出 -
{谷氨酸脑= 1,缺血 - 谷氨酸= 3,缺血 - 脑= 1,谷氨酸缺血= 3,脑耐受= 3,脑缺血= 1,耐受脑= 3,脑 - 谷氨酸= 1}
我认为问题在于第二块。对此问题的任何启示都将受到高度赞赏。
答案 0 :(得分:2)
从算法的角度来看,您可能需要考虑以下方法:
对于每个字符串,拆分,然后排序,然后重新组合(即采用DEF-ABC并转换为ABC-DEF.ABC-DEF将转换为ABC-DEF)。然后将其用作频率计数的关键。
如果您需要保留确切的原始项目,只需将其包含在您的密钥中 - 这样密钥就会:ordinal(重新组合的字符串)和原始项目。
答案 1 :(得分:1)
免责声明:我偷了 Kevin Day 建议的甜蜜技巧来实施。
我仍想发帖只是为了让您知道使用正确的数据结构(Multiset/Bad)和正确的库(google-guava)不仅简化代码但也使它高效。
public class BasicFrequencyCalculator
{
public static void main(final String[] args) throws IOException
{
@SuppressWarnings("unchecked")
Multiset<Word> frequency = Files.readLines(new File("c:/2.txt"), Charsets.ISO_8859_1, new LineProcessor() {
private final Multiset<Word> result = HashMultiset.create();
@Override
public Object getResult()
{
return result;
}
@Override
public boolean processLine(final String line) throws IOException
{
result.add(new Word(line));
return true;
}
});
for (Word w : frequency.elementSet())
{
System.out.println(w.getOriginal() + " = " + frequency.count(w));
}
}
}
public class Word
{
private final String key;
private final String original;
public Word(final String orig)
{
this.original = orig.trim();
String[] temp = original.toLowerCase().split("-");
Arrays.sort(temp);
key = temp[0] + "-"+temp[1];
}
@Override
public int hashCode()
{
final int prime = 31;
int result = 1;
result = prime * result + ((getKey() == null) ? 0 : getKey().hashCode());
return result;
}
@Override
public boolean equals(final Object obj)
{
if (this == obj)
{
return true;
}
if (obj == null)
{
return false;
}
if (!(obj instanceof Word))
{
return false;
}
Word other = (Word) obj;
if (getKey() == null)
{
if (other.getKey() != null)
{
return false;
}
}
else if (!getKey().equals(other.getKey()))
{
return false;
}
return true;
}
@Override
public String toString()
{
return getOriginal();
}
public String getKey()
{
return key;
}
public String getOriginal()
{
return original;
}
}
BRAIN-TOLERATE = 3
ISCHEMIA-GLUTAMATE = 3
GLUTAMATE-BRAIN = 1
ISCHEMIA-BRAIN = 1
答案 2 :(得分:0)
感谢大家的帮助。这就是我解决它的方法 -
// Mapping of String->Integer (word -> frequency)
TreeMap<String, Integer> frequencyMap = new TreeMap<String, Integer>();
// Iterate through each line of the file
String[] temp;
String currentLine;
String currentLine2;
while ((currentLine = in.readLine()) != null) {
temp=currentLine.split("-");
currentLine2=temp[1]+"-"+temp[0];
// Iterate through each word of the current line
StringTokenizer parser = new StringTokenizer(currentLine);
while (parser.hasMoreTokens()) {
String currentWord = parser.nextToken();
Integer frequency = frequencyMap.get(currentWord);
Integer frequency2 = frequencyMap.get(currentLine2);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
if (frequency2 == null)
frequency = 0;
else {
frequencyMap.put(currentLine2, frequency2 + 1);
break;
}//else
} //if (frequency == null)
frequencyMap.put(currentWord, frequency + 1);
}//while (parser.hasMoreTokens())
}//while ((currentLine = in.readLine()) != null)
// Display our nice little Map
System.out.println(frequencyMap);