gensim的get_document_topics方法返回的概率不会加起来

时间:2017-06-15 15:36:39

标签: text-mining gensim lda topic-modeling

有时它返回所有主题的概率,一切都很好,但有时它只返回几个主题的概率而且它们不会加起来,它似乎取决于文档。通常,当它返回的主题很少时,概率加起来或多或少达到80%,那么它只返回最相关的主题吗?有没有办法强迫它返回所有概率?

也许我错过了一些东西,但我找不到方法参数的任何文档。

2 个答案:

答案 0 :(得分:1)

我遇到了同样的问题,并通过在调用gensim.models.ldamodel.LdaModel对象的get_document_topics方法时加入了参数minimum_probability = 0来解决了这个问题。

topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)

默认情况下,gensim不会输出低于0.01的概率,因此特别是对于任何文档,如果在此阈值下分配了任何主题,则该文档的主题概率之和将不等于一个。

这是一个例子:

from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)

# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
    # Get topic probabilites for each document
    topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
    probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
    # Print output
    print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
    print(f"Sum of probabilites:")
    for i, P in enumerate(probabilities):
        sum_P = sum(P)
        print(f"\tdoc {i} = {sum_P}")


# OUTPUT
Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
    doc 0 = 0.6733324527740479
    doc 1 = 0.8585712909698486
    doc 2 = 0.7549994885921478
    doc 3 = 0.8019999265670776
    doc 4 = 0.7524996995925903
    doc 5 = 0
    doc 6 = 0
    doc 7 = 0
    doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
    doc 0 = 1.0000000400468707
    doc 1 = 1.0000000337604433
    doc 2 = 1.0000000079162419
    doc 3 = 1.0000000284053385
    doc 4 = 0.9999999937135726
    doc 5 = 0.9999999776482582
    doc 6 = 0.9999999776482582
    doc 7 = 0.9999999776482582
    doc 8 = 0.9999999930150807

此默认行为在文档中没有非常清楚地说明。 get_document_topics方法的minimum_probability的默认值为None,但是这不会将概率设置为零。取而代之的是,minimum_probability的值设置为gensim.models.ldamodel.LdaModel对象的minimum_probability的值,默认情况下为0.01,如您在source code中所见:

def __init__(self, corpus=None, num_topics=100, id2word=None,
             distributed=False, chunksize=2000, passes=1, update_every=1,
             alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
             iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
             random_state=None, ns_conf=None, minimum_phi_value=0.01,
             per_word_topics=False, callbacks=None, dtype=np.float32):

答案 1 :(得分:0)

我正在研究LDA主题建模,并发现了这篇文章。我确实创建了两个主题,比如topic1和topic2。

每个主题的前10个单词如下:0.009*"would" + 0.008*"experi" + 0.008*"need" + 0.007*"like" + 0.007*"code" + 0.007*"work" + 0.006*"think" + 0.006*"make" + 0.006*"one" + 0.006*"get

0.027*"ierr" + 0.018*"line" + 0.014*"0.0e+00" + 0.010*"error" + 0.009*"defin" + 0.009*"norm" + 0.006*"call" + 0.005*"type" + 0.005*"de" + 0.005*"warn

最后,我拿了1份文件来确定最接近的话题。

for d in doc:
    bow = dictionary.doc2bow(d.split())
    t = lda.get_document_topics(bow)

,输出为[(0, 0.88935698141006414), (1, 0.1106430185899358)]

要回答您的第一个问题,概率确实会为文档添加1.0,这就是get_document_topics的作用。该文档明确指出它返回给定文档弓的主题分布,作为(topic_id,topic_probability)2元组的列表。

此外,我尝试使用get_term_topics作为关键字" ierr "

t = lda.get_term_topics("ierr", minimum_probability=0.000001),结果是[(1, 0.027292299843400435)],这只是确定每个主题的贡献一词,这是有道理的。

因此,您可以根据使用get_document_topics获得的主题分布标记文档,并且可以根据get_term_topics给出的贡献确定单词的重要性。

我希望这会有所帮助。