Classifying new text using LDA in R

时间:2018-01-15 18:14:16

标签: r text-mining lda topic-modeling

I am trying out topic modeling using R for the first time. So, this might be a very dumb question but I am stuck and googling has not given a definitive answer.

Given a corpus of documents, I used the LDA function to identify the different topics in the corpus. Once, the model has been fitted, how can I apply the model on a new batch of documents to classify them among the topics discovered so far?

example code:

data("AssociatedPress", package = "topicmodels")

n <- nrow(AssociatedPress)
train_data <- sample(1:n,0.75*n,replace = FALSE)
AssociatedPress_train <- AssociatedPress[(train_data),]
AssociatedPress_test <- AssociatedPress[!(train_data),]

ap_lda <- LDA(AssociatedPress_train, k = 5, control = list(seed = 1234))

Now, can I classify the documents in AssociatedPress_test using the fitted model ap_lda? If yes, how? If not, what would be the best way to create a model for such future classification?

1 个答案:

答案 0 :(得分:2)

您可以使用topicmodels::posterior()功能作为查找AssociatedPress_test对象中每个新文档的“热门主题”的方法。下面是一个片段,展示了如何实现这一目标。

# code provided in quesiton------------------------------------------
library(tm)
data("AssociatedPress", package = "topicmodels")

n <- nrow(AssociatedPress)
train_data <- sample(1:n, 0.75*n, replace = FALSE)
AssociatedPress_train <- AssociatedPress[ train_data, ]
AssociatedPress_test  <- AssociatedPress[-train_data, ]

ap_lda <- topicmodels::LDA(AssociatedPress_train, k = 5, 
                           control = list(seed = 1234))
#--------------------------------------------------------------------

#posterior probabilities of topics for each document & terms
post_probs <- topicmodels::posterior(ap_lda, AssociatedPress_test)

#classify documents by finding topic with max prob per doc
top_topic_per_doc <- apply(post$topics, 1, which.max)

head(top_topic_per_doc)

#OUTPUT
# [1] 4 2 4 2 2 2