如何解释Sklearn LDA困惑得分。为什么随着主题数量的增加它总是会增加?

时间:2017-08-13 07:08:35

标签: python scikit-learn topic-modeling perplexity

我尝试使用sklearn的LDA模型找到最佳主题数。为此,我通过引用https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2上的代码来计算困惑。

但是当我增加主题数量时,困惑总是会不合理地增加。我在实施中是错误的还是只是给出了正确的值?

$url = $_GET['url'];
$url = rtrim($url,'/');
$url = explode('/',$url);//breaks string to array
require 'controllers/'.$url[0].'.php';
$controller = new index();
print_r($url);

对LDA使用tf(原始术语计数)功能。

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
n_samples = 0.7
n_features = 1000
n_top_words = 20
dataset = kickstarter['short_desc'].tolist()
data_samples = dataset[:int(len(dataset)*n_samples)]
test_samples = dataset[int(len(dataset)*n_samples):]

计算(5,10,15 ... 100个主题)

的困惑度
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
t0 = time()
tf_test = tf_vectorizer.transform(test_samples)
print("done in %0.3fs." % (time() - t0))

困惑计算的结果

for i in xrange(5,101,5):
    n_topics = i

    print("Fitting LDA models with tf features, "
          "n_samples=%d, n_features=%d n_topics=%d "
          % (n_samples, n_features, n_topics))

    lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                    learning_method='online',
                                    learning_offset=50.,
                                    random_state=0)
    t0 = time()
    lda.fit(tf)

    train_gamma = lda.transform(tf)
    train_perplexity = lda.perplexity(tf, train_gamma)

    test_gamma = lda.transform(tf_test)
    test_perplexity = lda.perplexity(tf_test, test_gamma)

    print('sklearn preplexity: train=%.3f, test=%.3f' %
          (train_perplexity, test_perplexity))

    print("done in %0.3fs." % (time() - t0))

1 个答案:

答案 0 :(得分:4)

scikit-learn中存在一个错误,导致困惑增加:

https://github.com/scikit-learn/scikit-learn/issues/6777