我正在实施Naive Bayes算法进行文本分类。我有大约1000个培训文件和400个测试文件。我认为我已正确实施了训练部分,但我对测试部分感到困惑。以下是我简要介绍的内容:
在我的训练功能中:
vocabularySize= GetUniqueTermsInCollection();//get all unique terms in the entire collection
spamModelArray[vocabularySize];
nonspamModelArray[vocabularySize];
for each training_file{
class = GetClassLabel(); // 0 for spam or 1 = non-spam
document = GetDocumentID();
counterTotalTrainingDocs ++;
if(class == 0){
counterTotalSpamTrainingDocs++;
}
for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term
if(class = 0){ //SPAM
spamModelArray[id]+= freq;
totalNumberofSpamWords++; // total number of terms marked as spam in the training docs
}else{ // NON-SPAM
nonspamModelArray[id]+= freq;
totalNumberofNonSpamWords++; // total number of terms marked as non-spam in the training docs
}
}//for
for i in vocabularySize{
spamModelArray[i] = spamModelArray[i]/totalNumberofSpamWords;
nonspamModelArray[i] = nonspamModelArray[i]/totalNumberofNonSpamWords;
}//for
priorProb = counterTotalSpamTrainingDocs/counterTotalTrainingDocs;// calculate prior probability of the spam documents
}
我认为我理解并正确实施了培训部分,但我不确定是否可以正确实施测试部分。在这里,我试图浏览每个测试文档,并为每个文档计算logP(垃圾| d)和logP(非垃圾邮件| d)。然后我比较这两个数量以确定类别(垃圾邮件/非垃圾邮件)。
在我的测试功能中:
vocabularySize= GetUniqueTermsInCollection;//get all unique terms in the entire collection
for each testing_file:
document = getDocumentID;
logProbabilityofSpam = 0;
logProbabilityofNonSpam = 0;
for each term in document{
freq = GetTermFrequency; // how many times this term appears in this document?
id = GetTermID; // unique id of the term
// logP(w1w2.. wn) = C(wj)∗logP(wj)
logProbabilityofSpam+= freq*log(spamModelArray[id]);
logProbabilityofNonSpam+= freq*log(nonspamModelArray[id]);
}//for
// Now I am calculating the probability of being spam for this document
if (logProbabilityofNonSpam + log(1-priorProb) > logProbabilityofSpam +log(priorProb)) { // argmax[logP(i|ck) + logP(ck)]
newclass = 1; //not spam
}else{
newclass = 0; // spam
}
}//for
我的问题是;我想返回每个类的概率而不是精确的1和0(垃圾邮件/非垃圾邮件)。我想看看,例如newclass = 0.8684212所以我可以稍后应用阈值。但我在这里很困惑。如何计算每个文档的概率?我可以使用logProbabilities来计算吗?
答案 0 :(得分:3)
由类< F1 , F2 ,..., Fn }所属的一组功能描述的数据概率<根据朴素贝叶斯概率模型,em> C 是
P(C|F) = P(C) * (P(F1|C) * P(F2|C) * ... * P(Fn|C)) / P(F1, ..., Fn)
除了1 / P ( F1 ,..., Fn )之外,您拥有所有条件(以对数形式)这个术语,因为你没有在你正在实施的朴素贝叶斯分类器中使用。 (严格来说,MAP分类器。)
您还必须收集这些要素的频率,并从中计算
P(F1, ..., Fn) = P(F1) * ... * P(Fn)