Question

我有以下代码使用，我能够绘制WSS曲线以找到膝盖，以便我可以为KMeans聚类选择K的值。

# To find WSS 
findWSS <- function(data) {
if (VERBOSE) {
    print(paste("[TRACER] Finding WSS.."))
}
start <- Sys.time()
wss <- (nrow(data)-1)*sum(apply(data,2,var))

for (i in 2:length(unique(data))) {
    wss[i] <- sum(kmeans(data, centers=i)$withinss)
}
if (ENABLE_PLOTS) {
    plot(1:length(unique(data)), wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
}
end <- Sys.time()
if (ENABLE_MEASUREMENTS && VERBOSE) {
    print(paste("[TIMER] Finding WSS:", difftime(end, start), "secs"))
}
}

以下是我得到的情节的代表：

上述图像中观察到的膝盖例如是3。但是我想在R

中以编程方式计算这个膝盖

关于我如何能够做到这一点的任何想法？

Answer 1

我使用了群集中的clusGap＆＃39;图书馆帮助解决这个问题。以下是我用来解决这个问题的代码：

# Compute Gap statistic (http://web.stanford.edu/~hastie/Papers/gap.pdf) to identify number of clusters
# Note: This method is slow due to bootstrapping
computeGapStatistic <- function(data, KMax) {
# gap <- clusGap((data), FUN = kmeans, K.max = 8, B = 3) 
gap <- clusGap((data), FUN = kmeans, K.max = KMax, B = 3) 
if (ENABLE_PLOTS) {
    plot(gap, main = "Gap statistic for the Nursing shift data")
}
clusterCount <- with(gap,maxSE(Tab[,"gap"],Tab[,"SE.sim"]))
if (VERBOSE) {
    print(paste("gap statsitics: ", gap[[1]]))
    print(paste("K: ", clusterCount))
}
return(clusterCount)
}

Answer 2

我做过的事情（不那么严谨，但可能更直观）是在WSS差异（即你的情节中的点之间的y差异）和集群它们之间采用差异 ，分为2组。这是基于这样一种观点，即大的变化是有意义的，而小的则不是，并且试图区分这些“大”和“小”集群。

然后我根据“小”组中的第一个差异选择组数，即如果'diff（k = 3，k = 4）'值是第一个'小'值，那么正确的数字群集的数量是3，因为当添加了3个以上的群集时，找不到更有意义的结构。

如果群集是分层的（因为它会过早停止），这将错过更细粒度的结构，但我发现它是一个不错的起点。

通过R代码对Kmeans聚类进行膝点估计

2 个答案: