How to get the cosine similarity between two documents in MALLET?

时间:2017-04-06 17:09:02

标签: java modeling lda mallet

I've an LDA topic model trained using MALLET but I want compute the cosine similarity between two documents to get the similarity but I'm not sure which file that MALLET outputs do I compute the cosine of.

My cosine similarity function is working fine but just not sure what I'm comparing in MALLET.

Any help would be appreciated!

1 个答案:

答案 0 :(得分:2)

每个文档都将以其主题组成来表示,因此您必须对这些文档进行比较。使用--output-doc-topics参数以获取所需文件。

行是文档,列是属于文档的每个主题的比例。在当前版本(2.0.8)中,列按主题ID递增排序 - 否则按最高概率排序。

除了余弦相似性之外,您还应该考虑不同的指标,例如: (对称的)Kullback-Leibler发散或Hellinger距离。