Question

我正在尝试解决一个聚类问题，只包含R中的二进制自变量。我只在R中有基本的理解。使用R代码尝试执行下面给出的步骤，我观察到几个初始迭代的剪影系数超出其允许范围。附件是相同的快照。

接下来的步骤：

计算包含每对记录之间的jaccard差异的距离矩阵。功能：来自包装的vegandist：素食主义者。
使用k-means的距离矩阵，并从1到12多次运行k-means。函数：来自包的kmeansruns（）：fpc
捕获每次迭代的平均轮廓宽度（asw），并确定最佳轮廓，以获得最大轮廓。
对此进行交叉验证＆＃39; k＆＃39; （从步骤3中找到），仅使用100次迭代和自举样本来判断簇的稳定性。

我发现k-means（X轴）与asw（Y轴）中的轮廓值显示[k_versus_asw.jpeg]不一致的平均轮廓值。

有人可以帮助解决这里可能出现的问题吗？或者是否应该使用其他任何聚类算法？

附加此分析的代码和示例数据：

代码：

> ###############################################
> 
> library(vegan) library(fpc) library(reshape2) library(ggplot2)
> 
> dist <- vegdist(mydat2, method = "jaccard") clustering.asw <-
> kmeansruns(dist, krange = 1:12, criterion = "asw")
> clustering.asw$bestk
> 
> critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))
> 
> critframe <- melt(critframe, id.vars = c("k"), variable.name =
> "measure", value.name = "score")
> 
> ggplot(critframe, aes(x=k, y=score, color=measure)) +  
> geom_point(aes(shape=measure)) + geom_line(aes(linetype=measure)) +  
> scale_x_continuous(breaks=1:12, labels=1:12)
> 
> summary(clustering.asw)
> 
> kbest.p <- 2
> 
> cboot <- clusterboot(dist, clustermethod = kmeansCBI, runs = 100,
> iter.max = 100, krange=kbest.p, seed = 12345) groups <-
> cboot$result$partition
> 
> print(cboot$result$partition, kbest.p)
> 
> cboot$bootmean
> 
> cboot$bootbrd
> 
> ####################################################

示例数据：

ID V1 V2 V3 V4 V5 1 0 1 0 1 0 2 0 1 0 0 1 3 0 0 0 0 0 4 1 0 0 1 0 5 1 0 1 1 0 6 0 1 0 0 0 7 0 0 0 0 0 8 0 0 0 0 1 9 0 0 1 0 0 10 0 1 0 1 0 11 0 0 0 0 0 12 1 0 0 0 1 13 1 0 0 0 0 14 1 1 0 0 0 15 0 0 0 0 0 16 0 0 0 0 0 17 0 0 0 0 0 18 0 0 1 1 0 19 0 0 0 1 1 20 0 1 0 1 0

有40个这样的二进制列和大约350多个观察结果。

Answer 1

k-means 不能使用距离矩阵。只适用于平方欧几里德距离（在某些内核空间中等效距离是欧几里德，内核保留均值）。

它计算点到平均距离，而不是点对点距离。因此，距离矩阵是无用的。

尽管如此，Silhouette应该在[-1：+1]中，因此您使用的代码中存在不正确的内容 - 请查看代码，不要对待它作为一个黑盒子。

Answer 2

错误在于：

Option.empty[A]

当标准化“轮廓”值时，您将删除[-1,1]限制-并使其很难解释。

k-means的平均轮廓值超出-1到+ 1的允许范围

2 个答案: