Question

我在gensim的lda实现中尝试了alpha的三个默认选项，现在想知道结果：所有文档的主题概率总和小于语料库中的文档数（见下文）。例如，alpha ='symmetric'产生约9357作为主题概率的总和，但是，主题的数量是9459.可以告诉我这个意外结果的原因吗？

alpha = symmetric
nr_of_docs = 9459
sum_of_topic_probs = 9357.12285605

alpha = asymmetric
nr_of_docs = 9459
sum_of_topic_probs = 9375.29253851

alpha = auto
nr_of_docs = 9459
sum_of_topic_probs = 9396.40123459

Answer 1

我试图复制你的问题，但在我的情况下（使用一个非常小的语料库），我找不到三个总和之间的任何差异。
在任何其他人想要复制问题的情况下，我仍然会分享我尝试的路径; - ）

我使用gensim网站上的一些小例子来训练三种不同的LDA模型：

from gensim import corpora, models
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

lda_sym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
                                      chunksize =100000, passes=1, alpha='symmetric')
lda_asym = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
                                      chunksize =100000, passes=1, alpha='asymmetric')
lda_auto = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, update_every=1,
                                      chunksize =100000, passes=1, alpha='auto')

现在我总结所有文件的主题概率（总共9份文件）

counts = {}
for model in [lda_sym, lda_asym, lda_auto]:
    s = 0
    for doc_n in range(len(corpus)):
        s += pd.DataFrame(lda_sym[corpus[doc_n]])[1].sum()
        if s < 1:
            print('Sum smaller than 1 for')
            print(model, doc_n)
    counts[model] = s

实际上总和总是9：

counts = {<gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3908>: 9.0,
          <gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3048>: 9.0,
          <gensim.models.ldamodel.LdaModel at 0x7ff3cd1f3b70>: 9.0}

当然，这不是一个有代表性的例子，因为它太小了。如果可以的话，也许可以提供一些关于你的语料库的更多细节。

总的来说，我认为应该始终如此。我的第一个直觉是，空文档可能会改变总和，但事实并非如此，因为空文档只会产生与alpha相同的主题分布（这是有道理的）：

pd.DataFrame(lda_asym[[]])[1]

返回

0    0.203498
1    0.154607
2    0.124657
3    0.104428
4    0.089848
5    0.078840
6    0.070235
7    0.063324
8    0.057651
9    0.052911

与

相同

lda_asym.alpha

array([ 0.20349777,  0.1546068 ,  0.12465746,  0.10442834,  0.08984802,
    0.0788403 ,  0.07023542,  0.06332404,  0.057651  ,  0.05291085])

也总和为1.

从理论的角度来看，选择不同的alphas会产生完全不同的LDA模型。

Alpha是Dirichlet之前的超参数。 Dirichlet先验是我们绘制theta的分布。而theta成为决定主题分布形状的参数。从本质上讲，alpha会影响我们绘制主题分布的方式。这就是为什么选择不同的alphas也会给你带来稍微不同的结果

lda.show_topics()

但是我不明白为什么对于任何LDA模型或任何类型的文档，文档概率的总和应该与1不同。

Answer 2

我认为问题是默认设置，minimum_probability设置为0.01而不是0.00。

您可以签出LDA模型代码here：

因此，如果您使用默认设置来训练模型，则在将特定文档的各个主题的概率加起来时，它可能不会返回1.00。

由于minimum_probability是在here中传递的，因此您始终可以通过以下方式对其进行更改以将其重置：

your_lda_model_name.minimum_probability = 0.0

Gensim LDA alpha参数

2 个答案: