Question

我正在尝试在Gensim中获得LDA模型的最佳主题数。我发现的一种方法是计算每个模型的对数似然，并将每个模型相互比较，例如，在The input parameters for using latent Dirichlet allocation

因此，我研究了使用Gensim计算LDA模型的对数似然性，并发现了以下帖子：How do you estimate α parameter of a latent dirichlet allocation model?

基本上表明update_alpha（）方法实现了 Huang，Jonathan中描述的方法。 Dirichlet分布参数的最大似然估计。我仍然不知道如何使用libary获取此参数而不更改代码。

如何使用Gensim从LDA模型中获取对数似然？

使用Gensim有更好的方法来获得最佳主题数量吗？

Answer 1

一般的经验法则是在不同主题编号之间创建LDA模型，然后检查Jaccard similarity和相关性。在这种情况下，连贯性通过主题中高分单词之间的语义相似程度来衡量单个主题（这些单词是否在整个文本语料库中同时出现）。以下内容将为最佳主题数量提供强烈的直觉。在跳到分级Dirichlet过程之前，这应该是基线，因为已经发现该技术在实际应用中存在问题。

首先为要考虑的各种主题编号创建模型和主题词的词典，在这种情况下，corpus是已清理的标记，num_topics是要考虑的主题列表，而num_words是您要考虑使用的每个主题的热门单词数：

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from gensim.models import LdaModel, CoherenceModel
from gensim import corpora

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

# Considering 1-15 topics, as the last is cut off
num_topics = list(range(16)[1:])
num_keywords = 15

LDA_models = {}
LDA_topics = {}
for i in num_topics:
    LDA_models[i] = LdaModel(corpus=bow_corpus,
                             id2word=dirichlet_dict,
                             num_topics=i,
                             update_every=1,
                             chunksize=len(bow_corpus),
                             passes=20,
                             alpha='auto',
                             random_state=42)

    shown_topics = LDA_models[i].show_topics(num_topics=i, 
                                             num_words=num_keywords,
                                             formatted=False)
    LDA_topics[i] = [[word[0] for word in topic[1]] for topic in shown_topics]

现在创建一个函数来导出两个主题的Jaccard相似性：

def jaccard_similarity(topic_1, topic_2):
    """
    Derives the Jaccard similarity of two topics

    Jaccard similarity:
    - A statistic used for comparing the similarity and diversity of sample sets
    - J(A,B) = (A ∩ B)/(A ∪ B)
    - Goal is low Jaccard scores for coverage of the diverse elements
    """
    intersection = set(topic_1).intersection(set(topic_2))
    union = set(topic_1).union(set(topic_2))
                    
    return float(len(intersection))/float(len(union))

通过考虑下一个主题，使用以上内容得出各个主题的平均稳定性：

LDA_stability = {}
for i in range(0, len(num_topics)-1):
    jaccard_sims = []
    for t1, topic1 in enumerate(LDA_topics[num_topics[i]]): # pylint: disable=unused-variable
        sims = []
        for t2, topic2 in enumerate(LDA_topics[num_topics[i+1]]): # pylint: disable=unused-variable
            sims.append(jaccard_similarity(topic1, topic2))    
        
        jaccard_sims.append(sims)    
    
    LDA_stability[num_topics[i]] = jaccard_sims
                
mean_stabilities = [np.array(LDA_stability[i]).mean() for i in num_topics[:-1]]

gensim具有topic coherence的内置模型（此模型使用'c_v'选项）

coherences = [CoherenceModel(model=LDA_models[i], texts=corpus, dictionary=dirichlet_dict, coherence='c_v').get_coherence()\
              for i in num_topics[:-1]]

从这里大致得出每个主题数量的一致性和稳定性之间的差异的理想主题数量：

coh_sta_diffs = [coherences[i] - mean_stabilities[i] for i in range(num_keywords)[:-1]] # limit topic numbers to the number of keywords
coh_sta_max = max(coh_sta_diffs)
coh_sta_max_idxs = [i for i, j in enumerate(coh_sta_diffs) if j == coh_sta_max]
ideal_topic_num_index = coh_sta_max_idxs[0] # choose less topics in case there's more than one max
ideal_topic_num = num_topics[ideal_topic_num_index]

最后将这些指标绘制成主题编号：

plt.figure(figsize=(20,10))
ax = sns.lineplot(x=num_topics[:-1], y=mean_stabilities, label='Average Topic Overlap')
ax = sns.lineplot(x=num_topics[:-1], y=coherences, label='Topic Coherence')

ax.axvline(x=ideal_topic_num, label='Ideal Number of Topics', color='black')
ax.axvspan(xmin=ideal_topic_num - 1, xmax=ideal_topic_num + 1, alpha=0.5, facecolor='grey')

y_max = max(max(mean_stabilities), max(coherences)) + (0.10 * max(max(mean_stabilities), max(coherences)))
ax.set_ylim([0, y_max])
ax.set_xlim([1, num_topics[-1]-1])
                
ax.axes.set_title('Model Metrics per Number of Topics', fontsize=25)
ax.set_ylabel('Metric Level', fontsize=20)
ax.set_xlabel('Number of Topics', fontsize=20)
plt.legend(fontsize=20)
plt.show()

基于Jaccard相似度，理想的主题数量将使一致性最大化，并使主题重叠最小化。在这种情况下，我们可以安全选择14左右的主题编号。

Answer 2

虽然我无法对Gensim发表评论，但我可以权衡一些优化主题的一般建议。

正如您所说，使用对数似然是一种方法。另一种选择是保持模型生成过程中的一组文档，并在模型完成时推断出主题，并检查它是否有意义。

您可以尝试的完全不同的方法是分层Dirichlet过程，此方法可以在未指定的情况下动态查找语料库中的主题数。

有很多关于如何最好地指定参数和评估主题模型的论文，具体取决于您的经验水平，这些可能对您有益或可能不利：

Rethinking LDA: Why Priors Matter，Wallach，H.M.，Mimno，D。和McCallum，A。

Evaluation Methods for Topic Models，Wallach H.M.，Murray，I.，Salakhutdinov，R。和Mimno，D。

此外，这里有关于分层Dirichlet过程的论文：

Hierarchical Dirichlet Processes，Teh，Y.W.，Jordan，M.I.，Beal，M.J。和Blei，D.M。

使用Gensim获取LDA模型的最佳主题数的最佳方法是什么？

2 个答案: