Question

我得到的另一个问题的提示使我感到困惑。

我参加了一项练习，实际上是一项大型练习的一部分：

使用hclust（已完成）
给出一个全新的向量，找出你在1中最接近的群集。

根据练习，这应该在很短的时间内完成。

然而，几周之后我很困惑这是否可以完成，因为显然我真正从hclust得到的只是一棵树 - 而不是像我想的那样，有许多集群。

我想我不清楚：

Answer 1

您必须考虑正确的指标来定义与群集的紧密程度。在hclust doc中的示例的基础上，这里有一种方法来计算每个集群的均值，然后测量新数据点和均值集之间的距离。

# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]

# Put the B data into 10 clusters
hc   <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]

# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL

# Now add the hold out state to the set of averages
M <-rbind(M,KY)

# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust  = which.min(D[-length(D)])
memb[memb==KYclust]

# Now cluster the full set of states and compare the results.  
hc   <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]

Answer 2

与k-means相比，hclust发现的聚类可以是任意形状。

因此，到最近的集群中心的距离并不总是有意义的。

做一个最近邻居风格分配可能更好。

群集 - 如何查找离群集最近的群集

2 个答案: