Question

我已经在R中使用不同的聚类方法（kmeans，hclust，agnes，funny）对风暴的能量数据进行聚类，但即使很容易为我的工作选择最佳方法，我也需要计算（而不是理论））通过结果比较和评估方法的方法。你相信有什么东西吗？

提前致谢，

Answer 1

感谢您提出的问题，我了解到您可以使用eclust包中的factoextra函数计算最佳群集数量

使用here

中的kmeans演示

# Load and scale the dataset
data("USArrests")
DF <- scale(USArrests)

When data is not scaledd the clustering results might not be reliable [example](http://stats.stackexchange.com/questions/140711/why-does-gap-statistic-for-k-means-suggest-one-cluster-even-though-there-are-ob)

library("factoextra")

# Enhanced k-means clustering
res.km <- eclust(DF, "kmeans")


# Gap statistic plot
fviz_gap_stat(res.km$gap_stat)

群集功能的比较：

您可以使用所有可用方法并使用以下方法计算最佳群集数量：

clusterFuncList = c("kmeans", "pam", "clara", "fanny", "hclust", "agnes" ,"diana")


resultList <- sapply(clusterFuncList,function(x) {

cat("Begin clustering for function:",x,"\n")

#For each clustering function find optimal number of clusters, to disable plotting use graph=FALSE
clustObj = eclust(DF, x,graph=FALSE)

#return optimal number of clusters for each clustering function

cat("End clustering for function:",x,"\n\n\n")

resultDF = data.frame(clustFunc = x, optimalNumbClusters = clustObj$nbclust,stringsAsFactors=FALSE)

})

# >resultList
  # clustFunc optimalNumbClusters
# 1    kmeans                   4
# 2       pam                   4
# 3     clara                   5
# 4     fanny                   5
# 5    hclust                   4
# 6     agnes                   4
# 7     diana                   4

差距统计，即拟合度度量

“差距统计”用于衡量聚类算法的拟合度，请参阅paper

对于固定数量的用户定义群集，我们可以将每个群集算法的差距统计信息与clusGap包中的cluster函数进行比较：

numbClusters = 5

library(cluster)

clusterFuncFixedK = c("kmeans", "pam", "clara", "fanny")

gapStatList <- do.call(rbind,lapply(clusterFuncFixedK,function(x) {

cat("Begin clustering for function:",x,"\n")

set.seed(42)
#For each clustering function compute  gap statistic

gapStatBoot=clusGap(DF,FUNcluster=get(x),K.max=numbClusters)

gapStatVec= round(gapStatBoot$Tab[,"gap"],3)


gapStat_at_AllClusters = paste(gapStatVec,collapse=",")

gapStat_at_chosenCluster = gapStatVec[numbClusters]

#return gap statistic for each clustering function

cat("End clustering for function:",x,"\n\n\n")

resultDF = data.frame(clustFunc = x, gapStat_at_AllClusters = gapStat_at_AllClusters,gapStat_at_chosenCluster = gapStat_at_chosenCluster, stringsAsFactors=FALSE)

}))

# >gapStatList
#  clustFunc        gapStat_at_AllClusters gapStat_at_chosenCluster
#1    kmeans  0.184,0.235,0.264,0.233,0.27                    0.270
#2       pam 0.181,0.253,0.274,0.307,0.303                    0.303
#3     clara 0.181,0.253,0.276,0.311,0.315                    0.315
#4     fanny  0.181,0.23,0.313,0.351,0.478                    0.478

上表中每个算法的每个算法的间隙统计量从k = 1到5.第3列，gapStat_at_chosenCluster具有 k = 5簇时的差距统计量。统计越低，分区越好，因此，在k = 5个簇时，kmeans表现得更好相对于USArrests数据集

上的其他算法

在R中聚类海浪数据

1 个答案: