Question

我是一名博士生，正在学习将机器学习算法用于我的信息保障论文。

我一直在使用XGBoost对情感分析进行文本分类。我一直在使用Pang and Lee电影评论数据集，它是2000条电影评论，包括正片和负片。使用XGBoost，我设法获得98.33％的准确性。我当时正在考虑将PCA与我的单词袋功能集结合使用以减少尺寸。我在训练集上的R中使用了prcomp函数，但效果很好。训练集的准确性为99.8％。当我尝试在测试集上使用prcomp函数时，出现以下错误：

无法将常量/零列重新缩放为单位方差。

这是程序失败的代码行：

prin_comp <- prcomp(dtm_test, scale. = TRUE)

由于此问题，我无法运行测试集。我已经为该项目提供了完整的R代码。感谢您提供的任何帮助。

setwd('C:/rscripts/movies')

imdb = read.csv('movies.csv', stringsAsFactors = FALSE)

library(text2vec)
library(caret)
library(magrittr)
library(xgboost)
library(glmnet)
library(stringr)

colnames(imdb)<-c("class","text")
imdb$text<-as.character(imdb$text)

set.seed(100)
inTrain1<-createDataPartition(imdb$class,p=0.70,list=F)
train<-imdb[inTrain1,]
test<-imdb[-inTrain1,]

train<-cbind(train,id=rownames(train))
test<-cbind(test,id=rownames(test))
rownames(train)<-c(1:nrow(train))
rownames(test)<-c(1:nrow(test))

prep_fun = tolower
tok_fun = word_tokenizer

it_train = itoken(train$text, 
              preprocessor = prep_fun, 
              tokenizer = tok_fun, 
              ids = train$id, 
              progressbar = FALSE)

vocab = create_vocabulary(it_train)
iconv(vocab, "latin1", "ASCII", sub="")
vectorizer = vocab_vectorizer(vocab)

dtm_train = create_dtm(it_train, vectorizer)

stop_words = c("the", "is", "and", "have","off","why")

vocab = create_vocabulary(it_train, stopwords = stop_words)
gsub('[[:punct:] ]+',' ',vocab)
gsub("^\\s+|\\s+$", "", vocab)
gsub('[0-9]+', '', vocab)
gsub("@\\w+ *", "", vocab)
gsub("http", "", vocab)
gsub("1", "", vocab)
gsub("2", "", vocab)
gsub("http", "", vocab)
gsub("my", "", vocab)
gsub("too", "", vocab)
gsub("for", "", vocab)

pruned_vocab = prune_vocabulary(vocab, 
                            term_count_min = 10, 
                            doc_proportion_max = 0.5,
                            doc_proportion_min = 0.001)

vectorizer = vocab_vectorizer(pruned_vocab)

dtm_train  = create_dtm(it_train, vectorizer)

it_test = test$text %>% 
prep_fun %>% 
tok_fun %>% 
itoken(ids = test$id, 
     progressbar = FALSE)
dtm_test = create_dtm(it_test, vectorizer)
prin_comp <- prcomp(dtm_test, scale. = TRUE)  # GIVES ERROR

R中的PCA错误：“无法将常数/零列重新缩放为单位方差”

0 个答案: