我的任务是在亚马逊评论的数据集上应用LDA并获得50个主题
我已经在矢量中提取了评论文本,现在我正在尝试应用LDA
我创建了dtm
matrix <- create_matrix(dat, language="english", removeStopwords=TRUE, stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE)
<<DocumentTermMatrix (documents: 100000, terms: 174632)>>
Non-/sparse entries: 4096244/17459103756
Sparsity : 100%
Maximal term length: 218
Weighting : term frequency (tf)
但是当我尝试这样做时,我收到以下错误:
lda&lt; - LDA(矩阵,30)
Error in LDA(matrix, 30) :
Each row of the input matrix needs to contain at least one non-zero entry
搜索了一些解决方案并使用了大满贯
matrix1 <- rollup(matrix, 2, na.rm=TRUE, FUN = sum)
仍然得到相同的错误
我是新手,有人可以帮助我或建议我参考研究这个。这将是非常有帮助的
我的原始矩阵中没有空行,它只包含一个包含评论的列
答案 0 :(得分:1)
我已被分配了类似的任务,我也在学习和做,我已经有所发展,所以我正在分享我的代码片段,我希望能帮助。
library("topicmodels")
library("tm")
func<-function(input){
x<-c("I like to eat broccoli and bananas.",
"I ate a banana and spinach smoothie for breakfast.",
"Chinchillas and kittens are cute.",
"My sister adopted a kitten yesterday.",
"Look at this cute hamster munching on a piece of broccoli.")
#whole file is lowercased
#text<-tolower(x)
#deleting all common words from the text
#text2<-setdiff(text,stopwords("english"))
#splitting the text into vectors where each vector is a word..
#text3<-strsplit(text2," ")
# Generating a structured text i.e. Corpus
docs<-Corpus(VectorSource(x))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
#Removing all the special charecters..
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
docs<-tm_map(docs,removeWords,c("\t"," ",""))
dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))
#print(dtm)
freq<-colSums(as.matrix(dtm))
print(names(freq))
ord<-order(freq,decreasing=TRUE)
write.csv(freq[ord],"word_freq.csv")
burnin<-4000
iter<-2000
thin<-500
seed<-list(2003,5,63,100001,765)
nstart<-5
best<-TRUE
#Number of Topics
k<-3
# Docs to topics
ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))
ldaOut.topics<-as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))