在R -implemented LDA模型中使用text2vec包,但我想知道如何将每个文档分配给主题
BELOW HERE is my code:
library(stringr)
library(rword2vec)
library(wordVectors)
#install.packages("text2vec")
library(text2vec)
library(data.table)
library(magrittr)
prep_fun = function(x) {
x %>%
# make text lower case
str_to_lower %>%
# remove non-alphanumeric symbols
str_replace_all("[^[:alpha:]]", " ") %>%
# collapse multiple spaces
str_replace_all("\\s+", " ")
}
movie_review_train = prep_fun(movie_review_train)
tokens = movie_review_train[1:1000] %>%
tolower %>%
word_tokenizer
it = itoken(tokens, progressbar = FALSE)
v = create_vocabulary(it)
v
vectorizer = vocab_vectorizer(v)
t1 = Sys.time()
dtm_train = create_dtm(it, vectorizer)
print(difftime(Sys.time(), t1, units = 'sec'))
dim(dtm_train)
stop_words = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves")
t1 = Sys.time()
v = create_vocabulary(it, stopwords = stop_words)
print(difftime(Sys.time(), t1, units = 'sec'))
pruned_vocab = prune_vocabulary(v,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(pruned_vocab)
# create dtm_train with new pruned vocabulary vectorizer
t1 = Sys.time()
dtm_train = create_dtm(it, vectorizer)
print(difftime(Sys.time(), t1, units = 'sec'))
dtm_train_l1_norm = normalize(dtm_train, "l1")
tfidf = TfIdf$new()
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm = transform(dtm_train_tfidf, tfidf)
lda_model <-LDA$new(n_topics = ntopics
,doc_topic_prior = alphaprior
,topic_word_prior = deltaprior
)
lda_model$get_top_words(n = 10, topic_number = c(1:5), lambda = 0.3)
在此之后,我想将每个文档分配给相关主题。我得到了主题下面的术语列表,但我不知道如何映射。
答案 0 :(得分:0)
文档主题分布doc_topic_distr将每个文档投影到主题空间,可以根据Dmitriy Selivanov的文档,根据以下代码从以下代码计算得出(请参见http://text2vec.org/topic_modeling.html#example6)。
实际上,主题模型的两个重要输出是主题词矩阵和文档主题矩阵。主题词矩阵或主题词分布显示每个主题中的单词权重,而文档主题矩阵或文档主题分布则显示每个文档中主题的贡献。
doc_topic_distr =
lda_model$fit_transform(x = dtm, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)