Question

我一直在用相对较大的数据集进行聚类分析（约50.000个观测值和16个变量）。

library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data

由于我的计算机无法做到这一点，我做了一些信息子集（示例中为10 x 5.000,16.000，但计算时间为15分钟），我正在使用Mclust来确定最佳组数。

ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial with 16.000, 15min cumputing 
nfac <- mdper.f[ind,]#sampling
Fnac <- scale(nfac) #scale data
mod = Mclust(Fnac) #Determining the optimal number of clusters
summary(mod) #Summary

#Results:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust VII (spherical, varying volume) model with 9 components:

log.likelihood     n df    BIC      ICL
   128118.2 16000 80 255462 254905.3

Clustering table:
   1    2    3    4    5    6    7    8    9 
1879 2505 3452 3117 2846  464  822  590  325

结果始终为9（10个数据集中有10个为5.000），所以，我想这没关系.. 现在，我想为数据的其余部分分配估计的集群划分，以便集群的多维部分。

我该怎么办？

我开始使用Mclust对象，但我看不到如何处理它并应用于其余数据。例如，最佳解决方案是我的原始数据，其中包含分配了簇号（1到9）的额外列。

Answer 1

我在几分钟的工作后得到了答案：

首先，存在一个概念错误，数据集必须在分区之前进行缩放，然后才使用 predict（）

library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data

mdper.f.s <- scale(mdper.f)#Scaling data 
ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling with 16.000 
nfac <- mdper.f.s[ind,]#sampling
mod16 = Mclust(nfac)#Determining the optimal number of clusters, 15min cumputing with 7 vars

prediction<-predict(mod16 ,mdper.f.s )#Predict with calculated model and scaled data
mdper.f <- cbind(mdper.f,prediction$classification)#Assignment to the original data
colnames(mdper.f.pred)[8]<-"Cluster" #Assing name to the new column

在大型数据集中估算后的群集分配（Mclust）

1 个答案: