我是一名博士生,正在学习将机器学习算法用于我的信息保障论文。
我一直在使用XGBoost对情感分析进行文本分类。我一直在使用Pang and Lee电影评论数据集,它是2000条电影评论,包括正片和负片。使用XGBoost,我设法获得98.33%的准确性。我当时正在考虑将PCA与我的单词袋功能集结合使用以减少尺寸。我在训练集上的R中使用了prcomp
函数,但效果很好。训练集的准确性为99.8%。当我尝试在测试集上使用prcomp
函数时,出现以下错误:
无法将常量/零列重新缩放为单位方差。
这是程序失败的代码行:
prin_comp <- prcomp(dtm_test, scale. = TRUE)
由于此问题,我无法运行测试集。我已经为该项目提供了完整的R代码。感谢您提供的任何帮助。
setwd('C:/rscripts/movies')
imdb = read.csv('movies.csv', stringsAsFactors = FALSE)
library(text2vec)
library(caret)
library(magrittr)
library(xgboost)
library(glmnet)
library(stringr)
colnames(imdb)<-c("class","text")
imdb$text<-as.character(imdb$text)
set.seed(100)
inTrain1<-createDataPartition(imdb$class,p=0.70,list=F)
train<-imdb[inTrain1,]
test<-imdb[-inTrain1,]
train<-cbind(train,id=rownames(train))
test<-cbind(test,id=rownames(test))
rownames(train)<-c(1:nrow(train))
rownames(test)<-c(1:nrow(test))
prep_fun = tolower
tok_fun = word_tokenizer
it_train = itoken(train$text,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = train$id,
progressbar = FALSE)
vocab = create_vocabulary(it_train)
iconv(vocab, "latin1", "ASCII", sub="")
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)
stop_words = c("the", "is", "and", "have","off","why")
vocab = create_vocabulary(it_train, stopwords = stop_words)
gsub('[[:punct:] ]+',' ',vocab)
gsub("^\\s+|\\s+$", "", vocab)
gsub('[0-9]+', '', vocab)
gsub("@\\w+ *", "", vocab)
gsub("http", "", vocab)
gsub("1", "", vocab)
gsub("2", "", vocab)
gsub("http", "", vocab)
gsub("my", "", vocab)
gsub("too", "", vocab)
gsub("for", "", vocab)
pruned_vocab = prune_vocabulary(vocab,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(pruned_vocab)
dtm_train = create_dtm(it_train, vectorizer)
it_test = test$text %>%
prep_fun %>%
tok_fun %>%
itoken(ids = test$id,
progressbar = FALSE)
dtm_test = create_dtm(it_test, vectorizer)
prin_comp <- prcomp(dtm_test, scale. = TRUE) # GIVES ERROR