我正在尝试使用RTextTools执行Bigram文本分类。我在创建Bigrams的DocumentTermMatrix时遇到错误(如下所示)。如何解决?
示例数据集:
text <- c('sunny','rainy','sunny sunny','sunny rainy','rainy sunny','rainy rainy','sunny sunny sunny','sunny rainy sunny','sunny sunny rainy','rainy sunny sunny','rainy rainy sunny')
issunny <- c('y','n','y','n','n','n','y','y','y','y','n')
data <- data.frame(text,issunny)
使用的库:
library(RTextTools)
library(tm)
代码:
dtMatrix <- create_matrix(data["text"], ngramLength=2)
container <- create_container(dtMatrix, data$issunny, trainSize=1:nrow(data), virgin=FALSE)
model <- train_model(container, "TREE")
predictionData <- list("sunny sunny sunny rainy rainy", "rainy sunny rainy rainy", "hello", "", "this is another rainy world")
predMatrix <- create_matrix(predictionData, originalMatrix=dtMatrix)
predSize <- length(predictionData)
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE)
result <- classify_model(predictionContainer, model)
遇到错误:
ngramLength=1
dtMatrix <- create_matrix(data["text"], ngramLength=2);
Error in FUN(X[[i]], ...) : non-character argument
还尝试使用RWeka
库但遇到错误:
library(RTextTools)
library(tm)
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtMatrix <- DocumentTermMatrix(Corpus(VectorSource(data["text"])), control=list(tokenize = BigramTokenizer))
Error in `[.simple_triplet_matrix`(matrix, totalSize, ) : subscript out of bounds
修改
使用RWeka包在第二个代码中用as.character(data [,&#34; text&#34;])替换数据[&#34; text&#34;]修复了DocumentTermMatrix创建中遇到的错误。非常感谢LukeA的解决方案。
但现在我在创建预测矩阵容器时遇到另一个错误:
predictionContainer <- create_container(predMatrix, labels=rep(0,predSize), testSize=1:predSize, virgin=FALSE);
Error in validObject(.Object) : invalid class “matrix.csr” object: ra has too few, or too many elements