我试图使用topicmodels包中的LDA函数将调查问题列表分配到30个不同的类别。
我到目前为止的代码是:
source <- VectorSource(openended$q2)
corpus <- Corpus(source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = "english")
mat <- DocumentTermMatrix(corpus)
rowTotals <- apply(mat , 1, sum)
mat <- mat[rowTotals> 0, ]
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 30
ldaOut <-LDA(mat,k, method="Gibbs", control=list(nstart=nstart, seed = seed,
best=best, burnin = burnin, iter = iter, thin=thin))
ldaOut.topics <- as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
我已经有10%的数据在开放的$ q2中进行了适当编码,如何使用该数据训练算法?
谢谢!