有时它返回所有主题的概率,一切都很好,但有时它只返回几个主题的概率而且它们不会加起来,它似乎取决于文档。通常,当它返回的主题很少时,概率加起来或多或少达到80%,那么它只返回最相关的主题吗?有没有办法强迫它返回所有概率?
也许我错过了一些东西,但我找不到方法参数的任何文档。
答案 0 :(得分:1)
我遇到了同样的问题,并通过在调用gensim.models.ldamodel.LdaModel对象的get_document_topics方法时加入了参数minimum_probability = 0来解决了这个问题。
topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)
默认情况下,gensim不会输出低于0.01的概率,因此特别是对于任何文档,如果在此阈值下分配了任何主题,则该文档的主题概率之和将不等于一个。
这是一个例子:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)
# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
# Get topic probabilites for each document
topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
# Print output
print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
print(f"Sum of probabilites:")
for i, P in enumerate(probabilities):
sum_P = sum(P)
print(f"\tdoc {i} = {sum_P}")
# OUTPUT
Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
doc 0 = 0.6733324527740479
doc 1 = 0.8585712909698486
doc 2 = 0.7549994885921478
doc 3 = 0.8019999265670776
doc 4 = 0.7524996995925903
doc 5 = 0
doc 6 = 0
doc 7 = 0
doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
doc 0 = 1.0000000400468707
doc 1 = 1.0000000337604433
doc 2 = 1.0000000079162419
doc 3 = 1.0000000284053385
doc 4 = 0.9999999937135726
doc 5 = 0.9999999776482582
doc 6 = 0.9999999776482582
doc 7 = 0.9999999776482582
doc 8 = 0.9999999930150807
此默认行为在文档中没有非常清楚地说明。 get_document_topics方法的minimum_probability的默认值为None,但是这不会将概率设置为零。取而代之的是,minimum_probability的值设置为gensim.models.ldamodel.LdaModel对象的minimum_probability的值,默认情况下为0.01,如您在source code中所见:
def __init__(self, corpus=None, num_topics=100, id2word=None,
distributed=False, chunksize=2000, passes=1, update_every=1,
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf=None, minimum_phi_value=0.01,
per_word_topics=False, callbacks=None, dtype=np.float32):
答案 1 :(得分:0)
我正在研究LDA主题建模,并发现了这篇文章。我确实创建了两个主题,比如topic1和topic2。
每个主题的前10个单词如下:0.009*"would" + 0.008*"experi" + 0.008*"need" + 0.007*"like" + 0.007*"code" + 0.007*"work" + 0.006*"think" + 0.006*"make" + 0.006*"one" + 0.006*"get
0.027*"ierr" + 0.018*"line" + 0.014*"0.0e+00" + 0.010*"error" + 0.009*"defin" + 0.009*"norm" + 0.006*"call" + 0.005*"type" + 0.005*"de" + 0.005*"warn
最后,我拿了1份文件来确定最接近的话题。
for d in doc:
bow = dictionary.doc2bow(d.split())
t = lda.get_document_topics(bow)
,输出为[(0, 0.88935698141006414), (1, 0.1106430185899358)]
。
要回答您的第一个问题,概率确实会为文档添加1.0,这就是get_document_topics的作用。该文档明确指出它返回给定文档弓的主题分布,作为(topic_id,topic_probability)2元组的列表。
此外,我尝试使用get_term_topics作为关键字" ierr "
t = lda.get_term_topics("ierr", minimum_probability=0.000001)
,结果是[(1, 0.027292299843400435)]
,这只是确定每个主题的贡献一词,这是有道理的。
因此,您可以根据使用get_document_topics获得的主题分布标记文档,并且可以根据get_term_topics给出的贡献确定单词的重要性。
我希望这会有所帮助。