This is a bit of a theoretical and practical question at the same time. I have a df containing x, y, z coordinates of a list of points. These points are dots on a 3D surface generated by image segmentation. The question I am trying to address is if these points are randomly distributed on this surface or if they exhibit some clustering. I'm testing this in R.
The first method I am using is kmeans. I ask the computer to determine the best (if any) number of groups these data can be made to fit into. I am using this piece of code. It tests 30 different indices (various methods) and outputs the best number of clusters
library("NbClust")
nb <- NbClust(df, distance = "euclidean", min.nc = 2,
max.nc = 10, method = "kmeans")
library("factoextra")
fviz_nbclust(nb)
This code comes from http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-optimal-number-of-clusters-3-must-know-methods/
I get a certain number of clusters, which I guess is an indication of clustering in the first place. However, I would like to calculate a metric out of it? Suggestions on how to?
In addition I am also checking for clustering via histograms.
df_mat <- df %>% as.matrix()
dist_df <- dist(df_mat)
hist(dist_df)
You would expect multiple peaks for clustering, one single peak for more or less random distributions perhaps.
Another approach I am trying is hierarchical clustering
my_hclustdf <- hclust(dist_df)
plot(my_hclustdf)
However, the output, a dendogram, itself does not tell me much.
Any suggestion would be greatly appreciated. Many thanks
答案 0 :(得分:0)
随机分布在这个表面上或者如果它们表现出一些聚类
问题是这太模糊了。什么是随机分发的&#39;什么是一些聚类&#39;?
有一些工具可以测试这种情况。例如,Hopkins统计量可用于测试分布是否均匀随机。但缺乏统一的随机分布并不意味着存在集群 - 它并不均匀。类似的问题适用于k-means:仅仅因为某些启发式方法告诉你使用k = 3并不能证明有三个集群。即使在均匀随机数据中也可能表明这一点。如果你告诉k-means找到k个簇,那么它会找到k&#34;簇&#34;。即使是统一的随机数据。
您可能想要的是找到多个 - 单独的 - 密度模式。