应用错误收集

sklearn LatentDirichletAllocation LDA生成每个主题的功能/词的权重，并且很容易从 components _ 属性中get top n words per topic。现在，我想检索最适合这些“修剪”主题的文档（包含前n个单词中的大多数）。是否有使用sklearn而不实现new "similarity" method的简单/直接方法？

仅仅成为文档的主导主题似乎根本不合适。出现的文档中，“主要”主题的前n个单词中没有一个单词。

这里大致是我现在要解决的问题，但是它不能满足那些修剪过的主题的需求：

lda_output = model.transform(data_vectorized)
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
dominant_topic = np.argmax(df_document_topic.values, axis=1)

修剪lda主题模型的前n个特征并检索拟合文档

0 个答案: