我正在将this python implementation of lda用于具有潜在Dirichlet分配的主题建模。我为2 ... k个主题创建了k-1个模型实例,用于不同数量的主题。我试图根据对数似然系列的均值来确定最佳主题数。所以我的代码看起来像这样:
def optimal(models):
ll = [m.loglikelihoods_ for m in models]
means=[mean(vals) for vals in ll]
print(means)
val, idx = min([(val, idx) for (idx, val) in enumerate(means)])
print('\n*OPTIMAL:{} topics, likelihood:{}'.format(idx+2, val))
return models[idx]
结果是:
[-116437.19151950255, -116432.23207017583, -117125.84739129762, -115787.39060737971, -116028.07281838865, -116343.8361756514, -116698.45128717832, -116924.95163260077, -117215.84926933098]
*OPTIMAL:10 topics, likelihood:-117215.84926933098
这是正确的方法,还是我必须最大化而不是最小化?所以代码会说:
max([(val, idx) for (idx, val) in enumerate(means)])