Question

我正在使用sklearn的NMF和LDA子模块来分析未标记的文本。我阅读了文档，但我不确定这些模块中的转换函数（NMF和LDA）是否与R的主题模型中的后验函数相同（请参阅Predicting LDA topics for new data）。基本上，我正在寻找一种功能，它允许我使用训练集数据训练的模型预测测试集中的主题。我预测了整个数据集的主题。然后我将数据分成训练和测试集，在训练集上训练模型并使用该模型转换测试集。虽然预计我不会得到相同的结果，但比较两次运行主题并不能保证转换功能与R＆C的包具有相同的功能。非常感谢您的回复。

谢谢

Answer 1

transform模型上对LatentDirichletAllocation的调用会返回非标准化文档主题分发。要获得适当的概率，您可以简单地将结果标准化。这是一个例子：

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np

# grab a sample data set
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]

# vectorizer the features
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)

# train the model
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)

# predict topics for test data
# unnormalized doc-topic distribution
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))

# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

要查找排名靠前的主题，您可以执行以下操作：

doc_topic_dist.argmax(axis=1)

python - sklearn Latent Dirichlet Allocation Transform v.Fittransform

1 个答案: