欧几里德距离或余弦相似度?

时间:2012-08-25 09:27:54

标签: search-engine cluster-analysis information-retrieval euclidean-distance cosine-similarity

我正在读书 Similarity Measure 突然间,我的整个世界都崩溃了。我使用群集技术实现了一个搜索引擎。对于聚类,我使用K Means,其距离测量为欧几里德距离。我还使用余弦相似性来显示结果。我得到了惊人的准确结果。但是现在我读到了这个,我所做的是规范化文档向量并计算两个向量之间的欧氏距离,因此我没有考虑任何地方的量值。

我做错了吗?

虽然我认为更高的词频率可以弥补更高的tf-idf值和更高的归一化tf-idf值,因此可以适当地排名很高。 感谢

结果(使用非标准化向量,数字是欧氏距离)

61.79689257425985 222Proposed Research Details.doc
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
72.61392308146608 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc
65.51734241367222 done_RPFIII_dr.dogra.doc
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc
418.8868087170988 P. VIJAYA KUMAR (DSS).doc
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc
72.95414421468679 RPF-III__Indo-US_project.doc
82.25126123574397 220Introduction and objectives.doc

结果(使用归一化向量,数字是欧氏距离)

1.3435369899385359 222Proposed Research Details.doc
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc
1.2902191708899362 done_RPFIII_dr.dogra.doc
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc
1.29150899172647 RPF-III__Indo-US_project.doc
1.318016051789028 220Introduction and objectives.doc

结果(数字是余弦相似度)

0.09745417833344654 222Proposed Research Details.doc
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc
0.16766724553404047 done_RPFIII_dr.dogra.doc
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc
0.16600226214483405 RPF-III__Indo-US_project.doc
0.13141684361322944 220Introduction and objectives.doc

结果1和2彼此不一致,而2和3强烈相同。更相似,更小的距离。在聚类质心向量和每个文档的文档向量之间取距离。

事实上,最奇怪的结果是欧几里德距离为418且相似度最大为0.87的文档。归一化距离变为0.49并且与相似性一致。

1 个答案:

答案 0 :(得分:0)

当我从我的信息检索讲座中记得时,对两个向量进行归一化导致欧氏距离以及余弦相似性的反向排序顺序。