Question

我使用以下代码对我的文档进行主题建模：

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, max_df=0.85, min_df=3, ngram_range=(1,5))

tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()


from sklearn.decomposition import NMF

no_topics = 50

%time nmf = NMF(n_components=no_topics, random_state=11,  init='nndsvd').fit(tfidf)
topic_pr= nmf.transform(tfidf)

我认为topic_pr为我提供了每个文档不同主题的概率分布。换句话说，我预计输出中的数字（topic_pr）将是行X中的文档属于模型中的50个主题中的每一个的概率。但是，数字不会增加1.这些真的是概率吗？如果不是，有没有办法将它们转换为概率？

由于

Answer 1

NMF返回非负分解，与概率无关（据我所知）。如果您只想要概率，可以转换NMF的输出（L1归一化）

probs = topic_pr / topic_pr.sum(axis=1, keepdims=True)

这假设topic_pr是非负矩阵，在您的情况下也是如此。

编辑：显然有一个NMF的概率版本。

引用sklearn's documetation：

非负矩阵分解应用了两个不同的目标函数：Frobenius范数和广义Kullback-Leibler分歧。后者等同于概率潜在语义索引。

从同一个链接应用后者，这是您似乎需要的：

lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5)
topic_pr = lda.fit_transform(tfidf)

使用NMF的主题概率分布

1 个答案: