我正在用Java创建一个贝叶斯过滤系统。目前,我的代码通过使用单独的.txt
文件来学习垃圾邮件和良好的文本; learn.spam("spam.txt");
和learn.good("good.txt")
。
这两种方法几乎完全相同:
public void good(String file) throws IOException {
A2ZFileReader fr = new A2ZFileReader(file);
String content = fr.getContent();
String[] tokens = content.split(splitregex);
int goodTotal = 0;
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
goodTotal++;
if (words.containsKey(word)) {
Word w = (Word) words.get(word);
w.countGood();
} else {
Word w = new Word(word);
w.countGood();
words.put(word,w);
}
}
}
public void spam(String file) throws IOException {
A2ZFileReader fr = new A2ZFileReader(file);
String content = fr.getContent();
String[] tokens = content.split(splitregex);
int spamTotal = 0;//tokenizer.countTokens();
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
spamTotal++;
if (words.containsKey(word)) {
Word w = (Word) words.get(word);
w.countBad();
} else {
Word w = new Word(word);
w.countBad();
words.put(word,w);
}
}
}
Iterator iterator = words.values().iterator();
while (iterator.hasNext()) {
Word word = (Word) iterator.next();
word.calcBadProb(spamTotal);
}
}
现在我要解决的问题是,我有以下两个.txt
文件而不是:{/ p>
spam Gamble tonight only for a cheap price of $5 per hand.
ham Sex, I love it. I need it now.
ham yeah I know, I am going tonight that that place, ;) Come join me. You know you want to
ham It is pretty expensive, just this and that for only ($900)
spam Call 123123123 to use for free porn
邮件每行只有一封,垃圾邮件以垃圾邮件开头,好消息以火腿开头,带有一个标签。
如何更改方法以便我只使用一种方法和一个.txt
文件来训练它。
答案 0 :(得分:1)
将good
方法更改为:
public void good(String content) {
String[] tokens = content.split(splitregex);
int goodTotal = 0;
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i].toLowerCase();
Matcher m = wordregex.matcher(word);
if (m.matches()) {
goodTotal++;
if (words.containsKey(word)) {
Word w = (Word) words.get(word);
w.countGood();
} else {
Word w = new Word(word);
w.countGood();
words.put(word,w);
}
}
}
}
对spam
做几乎完全相同的事情。
然后编写一个方法train
来读取文件,将其拆分成行,然后根据每行中的第一个单词调用正确的方法。
在那之后,将所有内容合并到一个方法中是微不足道的。