Mclust:影响聚类结果的输入参数顺序

时间:2013-12-05 05:50:48

标签: r cluster-analysis

我使用mclust使用不同数量的输入(下面的脚本中的X,Y,Z,R和S)查看数据集中的各种群集:

e.g。

elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)

我只是发现输入参数的顺序很重要并影响结果; 换句话说,elements <- cbind(X,Y,Z,R,S)提供的群集不同于elements-<cbind(Y,Z,X,R,S)。 我的理解是,所有输入参数在聚类分析中具有相同的权重和重要性。我错了还是错误?

我在R 2.15.3和其他2个R版本中看到了这一点。

对上述内容的任何评论或解释均表示赞赏。

4 个答案:

答案 0 :(得分:2)

很遗憾,我无法评论或编辑我以前的评论,因此我发布了答案。 @ m-dz让我走上了一条我认为已经揭示出可能答案的道路。具体做法是:

> library(mclust)
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /   
 / /  / / /___/ /___/ /_/ /___/ // /    
/_/  /_/\____/_____/\____//____//_/    version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.

> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")

> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mclust_5.2.2

loaded via a namespace (and not attached):
[1] tools_3.3.2

正如您所看到的,这产生了两个与@ m-dz匹配的解决方案!但是,我之前做的是加载psych包。我现在看到这是sim屏蔽mclust。我猜测这会导致错误的解决方案:

> library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:mclust’:

    sim

> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.

> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:

 log.likelihood    n df      BIC      ICL
       3547.068 3195 49 6698.738 6692.126

Clustering table:
   1    2 
2759  436 

> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:

 log.likelihood    n  df      BIC      ICL
       18473.94 3195 137 35842.37 35834.51

Clustering table:
  1   2   3   4   5   6 
431 932 210 881 524 217 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] psych_1.6.9  mclust_5.2.2

loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2    foreign_0.8-67 mnormt_1.5-5  

答案 1 :(得分:1)

通常,高斯混合模型聚类随机初始化,因为它只能找到局部最大值。

不要指望它会一直返回相同的结果。

答案 2 :(得分:1)

编辑:

我以前的编辑重新开始。将第一行视为标题的read.file是正确的,但事实并非如此。显然,第1列到第6列,无论是调用V1, V2, V3, V4, V5, V6还是X1, X3, X5, X7, X9, X11,都会产生不同的结果。稍后我会进一步调查。

library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United Kingdom.1252 
# [2] LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
#   [1] magrittr_1.5 psych_1.7.5  mclust_5.3  
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.4.0    parallel_3.4.0    tools_3.4.0      
# [4] foreign_0.8-68    rstudioapi_0.6    mdaddins_0.0.0001
# [7] nlme_3.1-131      mnormt_1.5-5      grid_3.4.0       
# [10] lattice_0.20-35  

testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE)  # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")

testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()

# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))

# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))

# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))

旧答案:

我已尽力调查此问题:

  • 当前R(3.4.0)和mclust(5.3)测试:订单和种子没有效果;
  • mclust 4.2(13月13日当前提出问题的时候),同样,没有效果;
  • @ user3068797提到的
  • R 2.25.3:无法编译mclust 4.2,放弃,因为调试它需要很长时间;
  • @Cody没有提供sessionInfo(),所以不知道在哪里挖更多。

代码:

library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# other attached packages:
# [1] mclust_5.3

testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")

## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 



## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)

## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
  if(is.null(upper)) {
    upper <- any(x[row(x) < col(x)])
    lower <- any(x[row(x) > col(x)])
    if(upper && lower)
      stop("not a triangular matrix")
    if(!(upper || lower)) {
      x <- diag(x)
      return(diag(x * x))
    }
  }
  dimx <- dim(x)
  storage.mode(x) <- "double"
  .Fortran("uncholf",
           as.logical(upper),
           x,
           as.integer(nrow(x)),
           as.integer(ncol(x)),
           integer(1),
           PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol

## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
#   optimal number of clusters occurs at max choice

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
# log.likelihood    n df      BIC       ICL
#      -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
#   optimal number of clusters occurs at max choice



## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)

编辑:

Fortran函数unchol(mclust 4.2)和uncholf(mclust 5.3)没有区别: uncholf 5.3unchol 4.3

Mclust确实有所不同,但提供相同的结果,所以我猜改变只是修正错误等。Mclust 5.3Mclust 4.3

答案 3 :(得分:0)

我注意到这是一个非常古老的线程,但我认为发布(官方)答案仍然值得。该问题已在 mclust 5:聚类、分类和 使用高斯有限的密度估计 来自 R 期刊的混合模型 以及建议的解决方案([https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf][1]),第 305-307 页。简而言之,“由于数据的离散性或测量时四舍五入的连续数据,在存在粗糙数据的情况下可能会出现 MBHAC 方法的问题。在这种情况下,必须通过选择一对将被合并的实体。这通常是随机完成的,但无论采用哪种方法来打破联系,这种选择都会产生重要的后果,因为它改变了剩余观察的聚类。此外,最终的 EM 解决方案可能取决于变量的排序。”