我使用mclust
使用不同数量的输入(下面的脚本中的X,Y,Z,R和S)查看数据集中的各种群集:
e.g。
elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)
我只是发现输入参数的顺序很重要并影响结果;
换句话说,elements <- cbind(X,Y,Z,R,S)
提供的群集不同于elements-<cbind(Y,Z,X,R,S)
。
我的理解是,所有输入参数在聚类分析中具有相同的权重和重要性。我错了还是错误?
我在R 2.15.3和其他2个R版本中看到了这一点。
对上述内容的任何评论或解释均表示赞赏。
答案 0 :(得分:2)
很遗憾,我无法评论或编辑我以前的评论,因此我发布了答案。 @ m-dz让我走上了一条我认为已经揭示出可能答案的道路。具体做法是:
> library(mclust)
__ ___________ __ _____________
/ |/ / ____/ / / / / / ___/_ __/
/ /|_/ / / / / / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/ version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.
> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")
> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EII (spherical, equal volume) model with 9 components:
log.likelihood n df BIC ICL
-3597.466 3196 63 -7703.32 -7735.137
Clustering table:
1 2 3 4 5 6 7 8 9
774 150 752 486 227 224 238 178 167
> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EII (spherical, equal volume) model with 9 components:
log.likelihood n df BIC ICL
-3597.466 3196 63 -7703.32 -7735.137
Clustering table:
1 2 3 4 5 6 7 8 9
774 150 752 486 227 224 238 178 167
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_5.2.2
loaded via a namespace (and not attached):
[1] tools_3.3.2
正如您所看到的,这产生了两个与@ m-dz匹配的解决方案!但是,我之前做的是加载psych
包。我现在看到这是sim
屏蔽mclust
。我猜测这会导致错误的解决方案:
> library(psych)
Attaching package: ‘psych’
The following object is masked from ‘package:mclust’:
sim
> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.
> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:
log.likelihood n df BIC ICL
3547.068 3195 49 6698.738 6692.126
Clustering table:
1 2
2759 436
> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:
log.likelihood n df BIC ICL
18473.94 3195 137 35842.37 35834.51
Clustering table:
1 2 3 4 5 6
431 932 210 881 524 217
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] psych_1.6.9 mclust_5.2.2
loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2 foreign_0.8-67 mnormt_1.5-5
答案 1 :(得分:1)
通常,高斯混合模型聚类随机初始化,因为它只能找到局部最大值。
不要指望它会一直返回相同的结果。
答案 2 :(得分:1)
我以前的编辑重新开始。将第一行视为标题的read.file
是正确的,但事实并非如此。显然,第1列到第6列,无论是调用V1, V2, V3, V4, V5, V6
还是X1, X3, X5, X7, X9, X11
,都会产生不同的结果。稍后我会进一步调查。
library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# Matrix products: default
#
# locale:
# [1] LC_COLLATE=English_United Kingdom.1252
# [2] LC_CTYPE=English_United Kingdom.1252
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C
# [5] LC_TIME=English_United Kingdom.1252
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods
# [7] base
#
# other attached packages:
# [1] magrittr_1.5 psych_1.7.5 mclust_5.3
#
# loaded via a namespace (and not attached):
# [1] compiler_3.4.0 parallel_3.4.0 tools_3.4.0
# [4] foreign_0.8-68 rstudioapi_0.6 mdaddins_0.0.0001
# [7] nlme_3.1-131 mnormt_1.5-5 grid_3.4.0
# [10] lattice_0.20-35
testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE) # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()
# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))
# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))
# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))
# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))
# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))
我已尽力调查此问题:
代码:
library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
#
# other attached packages:
# [1] mclust_5.3
testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")
## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)
## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
if(is.null(upper)) {
upper <- any(x[row(x) < col(x)])
lower <- any(x[row(x) > col(x)])
if(upper && lower)
stop("not a triangular matrix")
if(!(upper || lower)) {
x <- diag(x)
return(diag(x * x))
}
}
dimx <- dim(x)
storage.mode(x) <- "double"
.Fortran("uncholf",
as.logical(upper),
x,
as.integer(nrow(x)),
as.integer(ncol(x)),
integer(1),
PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol
## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
#
# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
# best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
# optimal number of clusters occurs at max choice
set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm
# ----------------------------------------------------
#
# Mclust EII (spherical, equal volume) model with 9 components:
#
# log.likelihood n df BIC ICL
# -3597.466 3196 63 -7703.32 -7735.137
#
# Clustering table:
# 1 2 3 4 5 6 7 8 9
# 774 150 752 486 227 224 238 178 167
#
# Warning messages:
# 1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
# best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
# optimal number of clusters occurs at max choice
## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)
Fortran函数unchol(mclust 4.2)和uncholf(mclust 5.3)没有区别: uncholf 5.3,unchol 4.3
Mclust确实有所不同,但提供相同的结果,所以我猜改变只是修正错误等。Mclust 5.3,Mclust 4.3
答案 3 :(得分:0)
我注意到这是一个非常古老的线程,但我认为发布(官方)答案仍然值得。该问题已在 mclust 5:聚类、分类和 使用高斯有限的密度估计 来自 R 期刊的混合模型 以及建议的解决方案([https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf][1]),第 305-307 页。简而言之,“由于数据的离散性或测量时四舍五入的连续数据,在存在粗糙数据的情况下可能会出现 MBHAC 方法的问题。在这种情况下,必须通过选择一对将被合并的实体。这通常是随机完成的,但无论采用哪种方法来打破联系,这种选择都会产生重要的后果,因为它改变了剩余观察的聚类。此外,最终的 EM 解决方案可能取决于变量的排序。”