我是数据科学的新手,刚刚开始了一条新的道路 在激动人心的旅程中,我研究了EDA中的聚类分析。 在学习它的过程中,我读了很多这样的声明:
集群在情人眼中
但是在播放来自Kaggle的一些数据时,我决定运行一个for循环,以比较簇数和总和内的总数平方,如下所示:
尝试{1}
ss1<-list()
set.seed(2)
for(i in 2:100){
cluster<-kmeans(bone[,2:3],centers = i)
cluster$cluster
ss1[[i]]<-cluster$tot.withinss
}
bone$clusters<-cluster$cluster
#plotting number of clusters against tot.withinss
ss1<-unlist(ss1)
num_clust<-data.frame(x=1:99,y=ss1)
plot(num_clust,xlab='Number of Clusters',ylab='Total within sum squared error',
main='Number of Clusters Vs Tot.withinss')
abline(v = 4.5,lty='dashed')
尝试{2}
ss2<-list()
set.seed(2)
for(i in 2:100){
cluster<-kmeans(bone[,2:3],centers = i,nstart = 30)
cluster$cluster
ss2[[i]]<-cluster$tot.withinss
}
bone$clusters<-cluster$cluster
#plotting number of clusters against tot.withinss
ss2<-unlist(ss2)
num_clust<-data.frame(x=1:99,y=ss2)
plot(num_clust,xlab='Number of Clusters',ylab='Total within sum squared error',
main='Number of Clusters Vs Tot.withinss')
abline(v = 4.5,lty='dashed')
尝试{3}
ss3<-list()
set.seed(2)
for(i in 2:100){
cluster<-kmeans(bone[,2:3],centers = i,nstart = 100)
cluster$cluster
ss3[[i]]<-cluster$tot.withinss
}
bone$clusters<-cluster$cluster
#plotting number of clusters against tot.withinss
ss3<-unlist(ss3)
num_clust<-data.frame(x=1:99,y=ss3)
plot(num_clust,xlab='Number of Clusters',ylab='Total within sum squared error',
main='Number of Clusters Vs Tot.withinss')
abline(v = 4.5,lty='dashed')
尝试{4}
ss4<-list()
set.seed(2)
for(i in 2:100){
cluster<-kmeans(bone[,2:3],centers = i,nstart = 200)
cluster$cluster
ss4[[i]]<-cluster$tot.withinss
}
bone$clusters<-cluster$cluster
#plotting number of clusters against tot.withinss
ss4<-unlist(ss4)
num_clust<-data.frame(x=1:99,y=ss4)
plot(num_clust,xlab='Number of Clusters',ylab='Total within sum squared error',
main='Number of Clusters Vs Tot.withinss')
abline(v = 4.5,lty='dashed')
您可以看到: 1-有4个簇,它们的tot.withinss有很大的不同
2-不论随机起始质心的数量如何,这4个簇都是稳定的
结论:
我可以使用这种方法从图上确定聚类的数量,而不是根据求和平方内的总数之差来随机选择K吗?