Question

我正在尝试使用stm中的R包来计算不同时期语料库中主题的相对患病率或比例。例如，假设在时间段1，该时间段文档中的主题组成可能是：主题A的80％，主题B的10％和主题C的10％。在下一个时间段中，再次包含一堆文件，组成可能是70％A，30％B和0％C。

结构主题模型听起来很适合此主题，因为您可以定义协变量，例如时间。除了...我似乎无法弄清楚该怎么做。这是一个使用quanteda包中的一些数据的最小示例：

library(quanteda)
library(stm)
library(ggplot2)
library(dplyr)
library(tidyr)

# get data: US presidents inaugural speeches
df = dfm(data_corpus_inaugural, tolower = T, stem=T,remove=stopwords(), remove_punct=T) %>% 
 dfm_trim(min_termfreq = 10, max_docfreq = 0.75, docfreq_type="prop") %>% 
 convert("stm")
df$meta$Decade = as.numeric(gsub("^(...).*","\\1", df$meta$Year)) 
 # grouped the speeches by decade, instead of year
smod <- stm(df$documents, df$vocab, K = 5, verbose = FALSE, prevalence = ~Decade, data=df$meta )
summary(smod)

# attempt 1: use doc-topic proportions - but this is per document, not per decade...
labs = labelTopics(smod)
rownames(smod$theta) = 1:58
colnames(smod$theta) = 1:5
d=as.data.frame.table(smod$theta)
ggplot(d, aes(x = Var1, y = Freq, group = Var2, colour = Var2)) + 
  geom_point() +
  geom_line()

# attempt 1.2: use doc-topic proportions, but take mean per time period (decade)
# this sort of gives an idea how much on average each topic was present among the documents
d2 = cbind(smod$theta, df$meta$Decade); colnames(d2)[6] = "decade"
d2 %>% as.data.frame() %>% gather(topic, proportion, 1:5, factor_key = T) %>% 
 group_by(decade, topic) %>% 
 summarise(mean=mean(proportion))
# problem, won't sum to one, also not sure if correct approach

# attempt 2: try using the prevalence estimation
est = estimateEffect(1:5 ~ s(Decade), smod, df$meta) # based on stm help
plot(est, "Decade", model=smod, method="continuous")
abline(h=0)

最后一位生成一个对象，绘制该对象后，它的确看起来像是随时间流逝的流行。不过，有一些观察结果：平滑了（由于公式中的s()），而我正在寻找确切的逐期主题比例；还有一些值低于0（不仅是置信区间，还有一些估计曲线）。仅在回归公式中使用“十年”即可获得线性回归，这对于此任务毫无用处。 est$parameters对象包含每个样条线的截距和值（25次模拟）；将plot()保存为对象，并使用method="point"可以访问每个主题的计算出的means比例值（但仍来自平滑样条-增加df的值s()可提供更多值）。无论如何，通过plot命令进行的平滑回归似乎是获取比例的一种about回的方式。

这样的问题：如何~~让人联想到~~正确地从stm模型中获得每个周期的相对主题构成（比例）？

（或者：如果那是完全错误的方法来获得这样的主题比例，那还有什么更好的方法？）

（更新：this QA建议采用LDA中后继主题的方法；我不确定这是否与我的尝试1相同，或者这是否是一种好方法？）

如何在多个时间段内从STM中获取准确的相对主题患病率？

0 个答案: