Question

我是机器学习的新手。我正在使用术语频率功能对三个类中的推文进行分类。我的训练和测试数据适用于所有班级，所有班级的比例为70-30％。但是所有三个班级的准确度都是如此不同，如同一级高.92％准确，二级中等... 53％准确，三级非常低..准确率15％。任何人都可以告诉我的算法可能有什么问题吗？我有三个班级信息中立和隐喻。

`# Create term document matrix
tweets.information.matrix <- t(TermDocumentMatrix(tweets.information.corpus,control = list(wordLengths=c(4,Inf))));
tweets.metaphor.matrix <- t(TermDocumentMatrix(tweets.metaphor.corpus,control = list(wordLengths=c(4,Inf))));
tweets.neutral.matrix <- t(TermDocumentMatrix(tweets.neutral.corpus,control = list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));`

这就是它如何计算概率

     `probabilityMatrix <-function(docMatrix)
{
  # Sum up the term frequencies
  termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
  # Add one
  termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
  # Calculate the probabilties
  termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
  # Calculate the natural log of the probabilities
  termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
  # Add pretty names to the columns
  colnames(termSums)<-c("term","count","additive","probability","lnProbability")
  termSums
}

`

信息的结果是高度准确的，对于中性适度准确的nad，隐喻结果非常低

朴素贝叶斯中三个等级的准确度不同

0 个答案: