如何确定ID的分组方式是否相似?

时间:2019-03-23 20:46:16

标签: r cluster-analysis similarity

我对数据应用了两种不同的聚类算法,我想表达这些结果之间的共性。

数据的组织方式为

  • “ ID” =标识符
  • “ Group_1” =第一个聚类算法的结果
  • “ Group_2” =第二个聚类算法的结果。

Group_1 是层次聚类的输出,它在k = 5时具有最高的CVI,而 Group_2 是k-means聚类的输出,具有C = 10时CVI最高。

我想确定结果的相似性。

这里是数据,我试图找出它们的相似之处

structure(list(ID = c(400100L, 400101L, 400106L, 442306L, 443110L, 
443300L, 443301L, 443302L, 443303L, 443304L, 443307L, 443309L, 
443311L, 443312L, 443313L, 443314L, 443316L, 443317L, 443322L, 
443324L, 443328L, 443329L, 443330L, 443331L, 443332L, 443333L, 
443334L, 443339L, 443344L, 443345L, 443351L, 443365L, 443366L, 
443371L, 443378L, 443382L, 443383L, 443388L, 443390L, 443392L, 
443396L, 443398L, 443399L, 443506L, 443507L, 443511L, 443512L, 
443514L, 443521L, 443522L, 443800L, 443802L, 443816L, 443817L, 
443819L, 443820L, 443823L, 443825L, 443828L, 443829L, 443833L, 
443842L, 443855L, 443859L, 443876L, 443877L, 443879L, 444101L, 
444104L, 444202L, 444204L, 444207L, 444251L, 444305L, 444307L, 
444309L, 444312L, 444314L, 444325L, 444327L, 444328L, 444334L, 
444335L, 444339L, 444341L, 444346L, 444359L, 444501L, 444504L, 
444508L, 444509L, 444511L, 444512L, 444514L, 444517L, 444520L, 
444521L, 444547L, 444548L, 444554L, 445101L, 445106L, 445112L, 
445113L, 445115L, 445120L, 445141L, 445302L, 445303L, 445304L, 
445309L, 445312L, 445313L, 445315L, 445316L, 445318L, 445319L, 
445322L, 445327L, 445330L, 445333L, 445404L, 445405L, 445409L, 
445510L, 445522L, 445552L, 445560L, 451704L, 451705L, 452503L, 
452514L), Group_1 = c(1L, 1L, 2L, 2L, 3L, 2L, 4L, 2L, 2L, 1L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 5L, 2L, 2L, 4L, 4L, 4L, 5L, 5L, 
2L, 2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 2L, 2L, 1L, 3L, 1L, 1L, 
3L, 2L, 3L, 2L, 1L, 4L, 2L, 5L, 4L, 5L, 3L, 4L, 1L, 2L, 3L, 2L, 
2L, 5L, 4L, 2L, 2L, 5L, 1L, 1L, 1L, 2L, 5L, 4L, 4L, 2L, 3L, 3L, 
1L, 2L, 1L, 4L, 2L, 4L, 5L, 1L, 4L, 2L, 4L, 2L, 3L, 2L, 2L, 2L, 
1L, 2L, 2L, 3L, 4L, 2L, 2L, 3L, 4L, 1L, 1L, 5L, 2L, 2L, 3L, 4L, 
3L, 5L, 4L, 1L, 1L, 1L, 2L, 4L, 3L, 4L, 4L, 1L, 2L, 1L, 1L, 2L, 
5L, 4L, 4L, 2L, 4L, 3L, 1L, 1L, 3L, 5L), Group_2 = c(7, 7, 7, 
7, 8, 3, 3, 7, 3, 9, 6, 1, 7, 7, 10, 7, 4, 6, 7, 7, 6, 3, 3, 
10, 7, 6, 1, 7, 9, 1, 6, 7, 3, 1, 5, 3, 7, 2, 5, 6, 5, 4, 6, 
10, 1, 1, 1, 10, 1, 6, 7, 6, 6, 3, 7, 7, 6, 5, 7, 6, 9, 7, 8, 
6, 3, 7, 9, 3, 7, 6, 6, 2, 6, 3, 3, 2, 7, 1, 6, 6, 6, 3, 6, 6, 
3, 7, 7, 1, 3, 7, 3, 6, 8, 6, 3, 7, 6, 7, 7, 1, 3, 6, 7, 3, 7, 
3, 7, 3, 3, 5, 5, 2, 6, 3, 1, 6, 7, 6, 7, 5, 2, 7, 6, 5, 7, 1, 
8, 7, 3, 9, 7, 6)), row.names = c(NA, -132L), class = c("data.frame"))

我想知道两组之间的百分比协议,但是我不知道如何计算。

最终,我想得出以下结论:

ID按“ Group_1”和“ Group_2”分组,除以N

然后我的假设是,两种算法对ID进行了相似分组的标签均已正确标记,我可以使用其余ID重新进行聚类。

1 个答案:

答案 0 :(得分:0)

标准聚类评估措施,例如

  • 调整后的兰德指数(ARI)
  • 标准化互信息(NMI)

可用于评估两个聚类的相似性。很容易看出它们是对称的。