我是机器学习的新手。我正在使用术语频率功能对三个类中的推文进行分类。我的训练和测试数据适用于所有班级,所有班级的比例为70-30%。但是所有三个班级的准确度都是如此不同,如同一级高.92%准确,二级中等... 53%准确,三级非常低..准确率15%。 任何人都可以告诉我的算法可能有什么问题吗? 我有三个班级信息中立和隐喻。
`# Create term document matrix
tweets.information.matrix <- t(TermDocumentMatrix(tweets.information.corpus,control = list(wordLengths=c(4,Inf))));
tweets.metaphor.matrix <- t(TermDocumentMatrix(tweets.metaphor.corpus,control = list(wordLengths=c(4,Inf))));
tweets.neutral.matrix <- t(TermDocumentMatrix(tweets.neutral.corpus,control = list(wordLengths=c(4,Inf))));
tweets.test.matrix <- t(TermDocumentMatrix(tweets.test.corpus,control = list(wordLengths=c(4,Inf))));`
这就是它如何计算概率
`probabilityMatrix <-function(docMatrix)
{
# Sum up the term frequencies
termSums<-cbind(colnames(as.matrix(docMatrix)),as.numeric(colSums(as.matrix(docMatrix))))
# Add one
termSums<-cbind(termSums,as.numeric(termSums[,2])+1)
# Calculate the probabilties
termSums<-cbind(termSums,(as.numeric(termSums[,3])/sum(as.numeric(termSums[,3]))))
# Calculate the natural log of the probabilities
termSums<-cbind(termSums,log(as.numeric(termSums[,4])))
# Add pretty names to the columns
colnames(termSums)<-c("term","count","additive","probability","lnProbability")
termSums
}
`
信息的结果是高度准确的,对于中性适度准确的nad,隐喻结果非常低