I've an LDA topic model trained using MALLET but I want compute the cosine similarity between two documents to get the similarity but I'm not sure which file that MALLET outputs do I compute the cosine of.
My cosine similarity function is working fine but just not sure what I'm comparing in MALLET.
Any help would be appreciated!
答案 0 :(得分:2)
每个文档都将以其主题组成来表示,因此您必须对这些文档进行比较。使用--output-doc-topics
参数以获取所需文件。
行是文档,列是属于文档的每个主题的比例。在当前版本(2.0.8)中,列按主题ID递增排序 - 否则按最高概率排序。
除了余弦相似性之外,您还应该考虑不同的指标,例如: (对称的)Kullback-Leibler发散或Hellinger距离。