R中的多尺度层次聚类出错

时间:2012-10-15 14:36:01

标签: r cluster-analysis correlation hierarchical-clustering hclust

我正在使用名为pvclust的R包进行层次聚类,该包基于hclust构建,通过引入自举来计算所获得的聚类的显着性水平。

考虑以下具有3个维度和10个观察值的数据集:

mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
                        "D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
                        "G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
                        "J"=c(984,98,49)))

当我单独使用hclust时,聚类对欧几里得测量和相关度量都运行良好:

# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")

# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")

但是,当使用pvclust的每个设置时,如下所示:

library(pvclust)

# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)

# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)

...我收到以下错误:

  • 欧几里得:Error in hclust(distance, method = method.hclust) : must have n >= 2 objects to cluster
  • 相关性:Error in cor(x, method = "pearson", use = use.cor) : supply both 'x' and 'y' or a matrix-like 'x'

请注意,距离由pvclust计算,因此无需事先计算距离。另请注意,hclust方法(平均值,中位数等)不会影响问题。

当我将数据集的维度增加到4时,pvclust现在运行正常。为什么我在{3}及以下的pvclust收到了这些错误,但hclust却没有?此外,当我使用4维以上的数据集时,为什么错误会消失?

1 个答案:

答案 0 :(得分:2)

在函数pvclust的末尾,我们看到一行

mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust, 
    nboot = nboot, method.dist = method.dist, use.cor = use.cor, 
    method.hclust = method.hclust, store = store, weight = weight)

然后深入挖掘我们发现

getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust, 
    nboot, store, weight = F) 
{
    n <- nrow(data)
    size <- round(n * r, digits = 0)
    ....
            smpl <- sample(1:n, size, replace = TRUE)
            suppressWarnings(distance <- dist.pvclust(data[smpl, 
                ], method = method.dist, use.cor = use.cor))
    ....
}

另请注意,函数r的参数pvclust的默认值为r=seq(.5,1.4,by=.1)。好吧,实际上我们可以看到这个值正在某处改变:

Bootstrap (r = 0.33)... 

所以得到的是size <- round(3 * 0.33, digits =0) 1,最后data[smpl,]只有1行,小于2.纠正r后会返回一些错误这可能是无害的,也可以输出:

mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", 
                    nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
  number of items to replace is not a multiple of replacement length

如果结果令人满意,请告诉我。