I am trying out topic modeling using R for the first time. So, this might be a very dumb question but I am stuck and googling has not given a definitive answer.
Given a corpus of documents, I used the LDA function to identify the different topics in the corpus. Once, the model has been fitted, how can I apply the model on a new batch of documents to classify them among the topics discovered so far?
example code:
data("AssociatedPress", package = "topicmodels")
n <- nrow(AssociatedPress)
train_data <- sample(1:n,0.75*n,replace = FALSE)
AssociatedPress_train <- AssociatedPress[(train_data),]
AssociatedPress_test <- AssociatedPress[!(train_data),]
ap_lda <- LDA(AssociatedPress_train, k = 5, control = list(seed = 1234))
Now, can I classify the documents in AssociatedPress_test using the fitted model ap_lda? If yes, how? If not, what would be the best way to create a model for such future classification?
答案 0 :(得分:2)
您可以使用topicmodels::posterior()
功能作为查找AssociatedPress_test
对象中每个新文档的“热门主题”的方法。下面是一个片段,展示了如何实现这一目标。
# code provided in quesiton------------------------------------------
library(tm)
data("AssociatedPress", package = "topicmodels")
n <- nrow(AssociatedPress)
train_data <- sample(1:n, 0.75*n, replace = FALSE)
AssociatedPress_train <- AssociatedPress[ train_data, ]
AssociatedPress_test <- AssociatedPress[-train_data, ]
ap_lda <- topicmodels::LDA(AssociatedPress_train, k = 5,
control = list(seed = 1234))
#--------------------------------------------------------------------
#posterior probabilities of topics for each document & terms
post_probs <- topicmodels::posterior(ap_lda, AssociatedPress_test)
#classify documents by finding topic with max prob per doc
top_topic_per_doc <- apply(post$topics, 1, which.max)
head(top_topic_per_doc)
#OUTPUT
# [1] 4 2 4 2 2 2