我正在尝试运行Naive Bayes对R中的少数文本进行分类。当一起预测测试集(8个文本)时,我得到以下后验概率:(注意第一个文本的概率值)
使用library(e1071)
和library(tm)
。我的训练数据子集如下所示
subset<-dput(head(sample_data,8))structure(list(Text=c("Kretz, the former CEO of Hanover Corporation, previously pleaded guilty to money laundering, and conspiracy to commit securities fraud, wire fraud and mail fraud.","Tard¢n was the head of an international narcotics trafficking and money laundering syndicate which distributed over 7,500 kilograms of South American cocaine","Ellison previously pleaded guilty to charges of health care fraud and money laundering","Paris was under attack on Friday by ISIS","A fraud of 20 million Euros have been booked against him for financing terrorist activities","Black money has been a common issue with Indian progress","Corruption charges has been placed against the NGO","An enquiry commission has been put in place regarding the recent uproar "),Category=structure(c(1L,1L,1L,1L,1L,1L,1L,1L),.Label=c("Money laundering","Money laundering","Money laundering","Terrorist Financing","Terrorist Financing","Bribery and Corruption","Bribery and Corruption","Bribery and Corruption"),class="factors"),.Names=c("Text","Category"),class="data.frame"))
由于训练集非常小,我使用以下代码来准备我的训练数据以进行建模
traindata <- as.data.frame(rbind(as.matrix(subset[1:8, c(1,2)])),as.matrix(subset[1:8,c(1,2)]))
testdata<-structure(list(Text=c("he is in jail for corruption charges","he cheat","he is involved in a racket","this is a violation of the law","this bank is fraud","gaming dupe us","he committed fraud","bank is involved in forgery"),Category=c("","","","","","","","")),.Names=c("Text","Category"),class="data.frame")
在准备语料库,删除停用词,创建训练和测试矩阵后,我通过以下命令
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Category));
results<-predict(model, as.matrix(testmatrix), type="raw")
并获得以下结果
Posterior probability values for 8 texts in test set
但是当我只传递一个文本时(在这种情况下,首先是8个文本),后验概率变为:(第一个文本显示不同的概率值)
Posterior probability value changes for the first text from test set
我无法理解这是如何发生的,因为训练数据保持不变并且代码中没有任何变化。有人可以帮助我吗?