Question

我正在使用令人敬畏的quanteda包将我的dfm转换为topicmodels格式。然而，在这个过程中，我失去了我需要的docvars，以确定哪些主题在我的文档中最常见。鉴于topicmodels包（与STM一样）仅选择非零计数，这尤其是一个问题。原始dfm中的文档数量和模型输出因此不同。有没有办法让我正确识别casu中的文件？

Answer 1

我检查了你的结果。由于您的select语句，dfm_speeches中没有任何功能。将其转换为topicmodels使用的“dtm”格式，您确实得到一个没有文档和条款的文档术语矩阵。

但是如果您使用dfm_select选择会产生带有功能的dfm，然后将其转换为dtm格式，您将看到docvars出现。

dfm_speeches <- dfm(data_corpus_irishbudget2010, 
                    remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>% 
  dfm_trim(min_termfreq = 4, max_docfreq = 10)

dfm_speeches <- dfm_select(dfm_speeches, c("Bruton", "Cowen"))

docvars(dfm_speeches)

dfmlda <- convert(dfm_speeches, to = "topicmodels")

然后，这将与topicmodels进一步协作。我承认，如果您转换为tm的dtm并且您没有任何功能，您将看到文档出现在dtm中。如果没有功能，我不确定转换为topicmodels是否存在意外的副作用。

Answer 2

我认为问题没有清楚地描述，但我相信我明白这是什么。

主题模型的文档特征矩阵不能包含空文档，因此它们返回没有这些主题的命名矢量。但如果您将它们与文档名称匹配，您仍然可以使用它：

# mx is a quanteda's dfm
# topic is a named vector for topics from LDA

docvars(mx, "topic") <- topic[match(docnames(mx), names(topic))]

Answer 3

对不起，这是一个例子。

dfm_speeches <- dfm(data_corpus_irishbudget2010, 
            remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) %>% 
dfm_trim(min_termfreq = 4, max_docfreq = 10)
dfm_speeches <- dfm_select(dfm_speeches, c("corbyn", "hillary"))

library(topicmodels)
dfmlda <- convert(dfm_speeches, to = "topicmodels") %>% 
dfmlda

正如您所看到的，dfmlda对象是空的，因为我通过删除特定单词来修改我的dfm。

quanteda转换为保留docvars的topicmodels

3 个答案: