Question

我想对41个变量和415个观测值的大型非监督数据集使用模糊C均值聚类。但是，我坚持尝试验证那些集群。当我用随机数的群集进行绘图时，我可以解释总共54％的方差，这不是很大，并且没有像iris数据库那样的群集。 >

首先，我将fcm的缩放数据运行在3个群集上，只是为了查看，但是如果我试图找到搜索最佳群集数的方法，那么我不想设置任意定义的集群数。

因此，我转向google和googled：“ R中的校验模糊聚类”。 This link here was good，但我仍然必须尝试一堆不同数量的群集。我查看了advclust，ppclust和clvalid软件包，但是找不到这些功能的演练。我查看了每个软件包的文档，但也看不出下一步该怎么做。

我遍历了一些可能的簇，并用fanny的k.crisp对象检查了每个簇。我从100开始，下降到4。根据文档中的对象描述，

k.crisp = integer（≤k）给出清晰簇的数量；可以小于 k，建议减少memb.exp。

这似乎不是一种有效的方法，因为它正在将清晰聚类的数量与模糊聚类的数量进行比较。

是否存在可以从2:10个群集中检查群集有效性的功能？另外，是否值得花时间检查1个群集的有效性？我认为这是一个愚蠢的问题，但我感到很奇怪，我可能会得到1个最佳聚类。（除了在里面哭了一点以外，如果我要得到1个集群怎么办？）

代码

library(cluster)
library(factoextra)
library(ppclust)
library(advclust)
library(clValid)
data(iris)
df<-sapply(iris[-5],scale)
res.fanny<-fanny(df,3,metric='SqEuclidean')
res.fanny$k.crisp
# When I try to use euclidean, I get the warning all memberships are very close to 1/l. Maybe increase memb.exp, which I don't fully understand
# From my understanding using the SqEuclidean is equivalent to Fuzzy C-means, use the website below. Ultimately I do want to use C-means, hence I use the SqEuclidean distance
fviz_cluster(Res.fanny,ellipse.type='norm',palette='jco',ggtheme=theme_minimal(),legend='right')
fviz_silhouette(res.fanny,palette='jco',ggtheme=theme_minimal())

# With ppclust
set.seed(123)
res.fcm<-fcm(df,centers=3,nstart=10)

website as mentioned above。

Answer 1

据我所知，您需要遍历不同数量的聚类，并查看所解释的方差百分比如何随不同数量的聚类而变化。该方法称为肘法。

wss <- sapply(2:10, 
       function(k){fcm(df,centers=k,nstart=10)$sumsqrs$tot.within.ss})

plot(2:10, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

结果图是

在k = 5之后，群集的平方和内的总数趋于缓慢变化。因此，根据肘法，k = 5是获得最佳簇数的良好候选。

验证模糊聚类

1 个答案: