我已经在包含文本消息的数据集上构建了一个贝叶斯分类器(来自bnlearn
包,因为我想做一个多项式贝叶斯模型)。
我的训练集如下所示:我必须将给定的消息分类到特定的CLASS中。
message
Worth reading mums;;;hope we too could
Musical bonding classes for a 9 month old- Yay or Nay? Should we start or wait for a few more months?
Girls...what plans for valentine...?.
CLASS
1
2
3
dataset <- read.csv("Traindataset.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE)
df <- Corpus(VectorSource(dataset$message))
df1 <- tm_map(df, stripWhitespace)
df1 <- tm_map(df1, tolower)
df1 <- tm_map(df1, removePunctuation)
df1 <- tm_map(df1, removeNumbers)
df1 <- tm_map(df1, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(df1)
dtm1 <- as.matrix(dtm)
dtm1 <- as.data.frame(cbind(dtm1, CLASS = dataset$CLASS))
dtm1 <- as.data.frame(lapply(dtm1, as.factor))
bn <- naive.bayes(dtm1, "CLASS")
pred = predict(bn, dtm1)
当我预测相同的数据时,它可以正常工作而不会丢失任何错误。我面临的问题是,当我在看不见的数据bn
上测试模型tst
时,它给出了一个错误,即网络和数据具有不同数量的变量。需要帮助。
tst <- read.csv("TestDataset.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE)
df <- Corpus(VectorSource(tst$message))
df1 <- tm_map(df, stripWhitespace)
df1 <- tm_map(df1, tolower)
df1 <- tm_map(df1, removePunctuation)
df1 <- tm_map(df1, removeNumbers)
df1 <- tm_map(df1, removeWords, stopwords("english"))
dtmtest <- DocumentTermMatrix(df1)
dtmtest1 <- as.matrix(dtmtest)
dtmtest1 <- as.data.frame(cbind(dtmtest1, CLASS = tst$CLASS))
dtmtest1 <- as.data.frame(lapply(dtmtest1, as.factor))
> pred = predict(bn, dtmtest1)
Error in check.bn.vs.data(x, data) :
the network and the data have different numbers of variables.
编辑:
> names(bn$tables) %in% names(dtmtest1)
logical(0)
> s <- names(bn$nodes) %in% names(dtmtest1)
> length(s)
[1] 6077
> sum(names(bn$nodes) %in% names(dtmtest1))
[1] 6057
> length(bn$nodes)
[1] 6077
> length(names(dtmtest1))
[1] 12509
> dtmtest1
> dtmtest
A document-term matrix (2309 documents, 12508 terms)
Non-/sparse entries: 51826/28829146
Sparsity : 100%
Maximal term length: 123
Weighting : term frequency (tf)
> dtm
A document-term matrix (872 documents, 6076 terms)
Non-/sparse entries: 17041/5281231
Sparsity : 100%
Maximal term length: 123
Weighting : term frequency (tf)
>