R:下标超出范围的朴素贝叶斯错误

时间:2017-02-03 13:17:41

标签: r classification naivebayes categorization

我试图对94种语言进行分类。 由于如果在测试集的类别中不存在列车集的类别,naiveBayes不能很好地工作,我随机化并确认。 类别没有问题。 但是分类器并没有使用testset。 以下是错误消息:

Df.dtm<-cbind(Df.dtm, category)
dim(Df.dtm)
Df.dtm[1:10, 530:532]

# Randomize and Split data by rownumber
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))
test <- (1:nrow(Df.dtm))[- train]

# Isolate classifier
cl <- Df.dtm[, "category"]
> summary(cl[train])
  dip  eds  ind pols 
  23    8    3   13 

# Create model data and remove "category"
modeldata <- Df.dtm[,!colnames(Df.dtm) %in% "category"]

#Boolean feature Multinomial Naive Bayes
#Function to convert the word frequencies to yes and no labels
convert_count <- function(x) {
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}

#Apply the convert_count function to get final training and testing DTMs
train.cc <- apply(modeldata[train, ], 2, convert_count)
test.cc <- apply(modeldata[test, ], 2, convert_count)

#Training the Naive Bayes Model
#Train the classifier
system.time(classifier <- naiveBayes(train.cc, cl[train], laplace = 1) )

这个分类器效果很好: 用户系统流逝 0.45 0.00 0.46

#Use the classifier we built to make predictions on the test set.
system.time(pred <- predict(classifier, newdata=test.cc))

然而,预测失败了。 [.default(对象$ tables [[v]] ,,, nd)中的错误:下标出界 定时停在:0.2 0 0.2

1 个答案:

答案 0 :(得分:0)

请考虑以下事项:

# Indicies of training observations as observations.
train <- sample(nrow(Df.dtm), ceiling(nrow(Df.dtm) * .50))

# Indicies of whatever is left over from the previous sample, again, also observations are being returned. 
#that still remains inside of Df.dtm, notation as follows:
test <- Df.dtm[-train,]

清理我的样本返回后(行指示)以及我想如何切片我的测试集(同样需要在此时建立行或列),我会调整apply函数使用必要的参数here is a link of how the apply function works,但是为了时间的推移,如果你传递2,则适用于每个column,如果你通过了1,它将适用每个row给出的函数。同样,根据您对样本(行或列)的需求,我们可以通过这种方式进行调整。