Question

我想运行层次聚类，通过单一链接来聚类具有300个特征和1500个观察值的文档。我想找到这个问题的最佳聚类数。

以下链接使用以下代码查找具有最大间隙的群集数。

http://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning

# Compute gap statistic 
set.seed(123)

iris.scaled <- scale(iris[, -5])

gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)

# Plot gap statistic 
fviz_gap_stat(gap_stat)

但是链接hcut没有明确定义。如何为clusGap()函数指定单链接层次聚类？

我们在python中有等效的clusGap()吗？

由于

Answer 1

hcut()函数是您发布的链接中使用的factorextra包的一部分：

hcut包：factoextra R文档

计算分层聚类并剪切树

说明

 Computes hierarchical clustering (hclust, agnes, diana) and cut
 the tree into k clusters. It also accepts correlation based
 distance measure methods such as "pearson", "spearman" and
 "kendall".

R还有一个内置函数hclust()，可用于执行层次聚类。但是，默认情况下，它不会执行单链接群集，因此您无法将hcut替换为hclust。

但是，如果查看clusGap()的帮助，您会看到可以提供要应用的自定义群集功能：

FUNcluster：一个'函数'，它接受第一个参数a（数据）像'x'的矩阵，第二个参数，比如k，k＆gt; = 2，数字所需的集群，并返回带有组件的“列表” 命名（或缩写为）'cluster'，它是一个长度的向量确定聚类的'1：k'中的'n = nrow（x）'整数或'n'观察的分组。

hclust()函数能够执行单链接层次聚类，因此您可以这样做：

cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)

如何使用Gap统计在层次聚类中找到最佳聚类数？

1 个答案: