我尝试使用sklearn的LDA模型找到最佳主题数。为此,我通过引用https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2上的代码来计算困惑。
但是当我增加主题数量时,困惑总是会不合理地增加。我在实施中是错误的还是只是给出了正确的值?
$url = $_GET['url'];
$url = rtrim($url,'/');
$url = explode('/',$url);//breaks string to array
require 'controllers/'.$url[0].'.php';
$controller = new index();
print_r($url);
from __future__ import print_function
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
n_samples = 0.7
n_features = 1000
n_top_words = 20
dataset = kickstarter['short_desc'].tolist()
data_samples = dataset[:int(len(dataset)*n_samples)]
test_samples = dataset[int(len(dataset)*n_samples):]
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
t0 = time()
tf_test = tf_vectorizer.transform(test_samples)
print("done in %0.3fs." % (time() - t0))
for i in xrange(5,101,5):
n_topics = i
print("Fitting LDA models with tf features, "
"n_samples=%d, n_features=%d n_topics=%d "
% (n_samples, n_features, n_topics))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online',
learning_offset=50.,
random_state=0)
t0 = time()
lda.fit(tf)
train_gamma = lda.transform(tf)
train_perplexity = lda.perplexity(tf, train_gamma)
test_gamma = lda.transform(tf_test)
test_perplexity = lda.perplexity(tf_test, test_gamma)
print('sklearn preplexity: train=%.3f, test=%.3f' %
(train_perplexity, test_perplexity))
print("done in %0.3fs." % (time() - t0))
答案 0 :(得分:4)
scikit-learn中存在一个错误,导致困惑增加: