Question

我已经在语料库上训练了一个LDA算法，而我想要做的就是为每个句子得到它对应的主题，以便比较算法找到的内容和我拥有的标签。

我已尝试使用下面的代码，但结果非常糟糕我发现很多话题17（可能是25％的音量，应该接近5％）

感谢您的帮助

# text lemmatized: list of string lemmatized
dico = Dictionary(texts_lemmatized)
corpus_lda = [dico.doc2bow(text) for text in texts_lemmatized]

lda_ = LdaModel(corpus_lda, num_topics=18)

df_ = pd.DataFrame([])
data = []

# theme_commentaire = label of the string
for i in range(0, len(theme_commentaire)):
     # lda_.get_document_topics() gives the distribution of all topic for a specific sentence
     algo = max(lda_.get_document_topics(corpus_lda[i]))[0]
     human = theme_commentaire[i]
     data.append([str(algo), human])

cols = ['algo', 'human']
df_ = pd.DataFrame(data, columns=cols)
df_.head()

Answer 1

已在评论中解决：

我发现了我的问题，它是max（）函数，它对我的元组列表的键值进行操作 [（num_topics，probability）]所以基本上我大部分时间都会得到17，因为它是最大的关键。 - glouis

Gensim在句子中找到主题

1 个答案: