Question

我正在为哈佛图书馆的书名和主题建模。

我使用Gensim Mallet Wrapper和Mallet的LDA进行建模。当我尝试获取Coherence和Perplexity值以查看模型的性能如何时，出现以下异常时无法计算出困惑度。如果我使用Gensim的内置LDA模型而不是Mallet，则不会出现相同的错误。我的语料库保存着7M +个文档，长度不超过50个单词，平均20个。所以文档很短。

下面是我代码的相关部分：

# TOPIC MODELING

from gensim.models import CoherenceModel
num_topics = 50

# Build Gensim's LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
# a measure of how good the model is. lower the better.

困惑：-47.91929228302663

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

一致性得分：0.28852857563541856

LDA给出的分数没有问题。现在，我用MALLET为同一袋单词建模

# Building LDA Mallet Model
mallet_path = '~/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, 
corpus=corpus, num_topics=num_topics, id2word=id2word)

# Convert mallet to gensim type
mallet_model = 
gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=mallet_model, 
texts=data_words_trigrams, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

相干分数：0.5994123896865993

然后，我要求输入“困惑度”值，并获得以下警告和NaN值。

# Compute Perplexity
print('\nPerplexity: ', mallet_model.log_perplexity(corpus))

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1108：   RuntimeWarning：乘法分数+ =中遇到无效值   np.sum（（self.eta-_lambda）* Elogbeta）

困惑：难

/app/app-py3/lib/python3.5/site-packages/gensim/models/ldamodel.py:1109：   RuntimeWarning：减去分数时遇到无效值+ =   np.sum（gammaln（_lambda）-gammaln（self.eta））

我意识到这是Gensim的一个非常具体的问题，需要对此功能有更深入的了解： gensim.models.wrappers.ldamallet.malletmodel2ldamodel（ldamallet）

因此，对于警告和Gensim域的任何评论，我将不胜感激。

Answer 1

给我几美分。

似乎在 lda_model.log_perplexity(corpus) 中，您使用的语料库与用于训练的语料库相同。如果使用语料库的保留/测试集，我的运气可能会更好。
lda_model.log_perplexity(corpus) 不返回 Perplexity。它返回“绑定”。如果你想把它变成困惑，做np.exp2(-bound)。我为此苦苦挣扎了一段时间:)
无法使用 Mallet 包装器报告 Perplexity afaik

Answer 2

我不认为为Mallet包装器实现了困惑功能。如Radims answer中所述，困惑会显示在标准输出上：

AFAIR，Mallet显示出对stdout的困惑-对您来说足够了吗？也可以以编程方式捕获这些值，但是我没有对此进行研究。希望Mallet也有一些API调用可以解决复杂性问题，但是包装中肯定没有包含它。

我只是在样本语料库上运行了它，实际上，LL / token确实每隔这么多次重复打印一次：

LL /令牌：-9.45493

困惑= 2 ^（-LL / token）= 701.81

具有短槌困惑性的Gensim主题建模

2 个答案: