Question

我有一个混合的数据集（分类变量和连续变量），我想使用Gower距离进行分层聚类。

我的代码基于https://www.r-bloggers.com/hierarchical-clustering-in-r-2/的示例，该示例将底数R dist()用于欧几里得距离。由于dist()无法计算Gower距离，因此我尝试使用philentropy::distance()进行计算，但它不起作用。

感谢您的帮助！

# Data
data("mtcars")
mtcars$cyl <- as.factor(mtcars$cyl)

# Hierarchical clustering with Euclidean distance - works 
clusters <- hclust(dist(mtcars[, 1:2]))
plot(clusters)

# Hierarchical clustering with Gower distance - doesn't work
library(philentropy)
clusters <- hclust(distance(mtcars[, 1:2], method = "gower"))
plot(clusters)

Answer 1

该错误来自distance函数本身。

我不知道它是否是有意的，但是使用{gower”方法的philentropy::distance的当前实现无法处理任何混合数据类型，因为第一个操作是转置data.frame，从而产生一个字符矩阵，当传递给DistMatrixWithoutUnit函数时，它将引发键入错误。

您可以尝试使用daisy中的cluster函数。

library(cluster)

x <- mtcars[,1:2]

x$cyl <- as.factor(x$cyl)

dist <- daisy(x, metric = "gower")

cls <- hclust(dist)

plot(cls)

编辑：供将来参考，似乎philentropy将被更新以在下一版本中包括更好的类型处理。从vignette

在以后的版本中，我将优化distance（）功能，以便内部检查数据类型的正确性和正确性输入数据将比基本dist（）花费更少的终止时间功能。

Answer 2

LLL; 抱歉，我不会英语，也无法解释。现在尝试一下。但是代码很好；-）

library(philentropy)
clusters <- hclust(
                   as.dist(
                          distance(mtcars[, 1:2], method = "gower")))
plot(clusters)

好看

Answer 3

使用<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet"/> <div class="outer-wrapp"> <div class="row"> <div class="col-xs-6 col-sm-6"> <ul class="left"> <li> <div class="wrapp"> <div class="box"> <div class="circle"></div> <span>Text</span> </div> </div> </li> <li> <div class="wrapp"> <div class="box"> <div class="circle"></div> </div> </div> </li> <li> <div class="wrapp"> <div class="box"> <div class="circle"></div> </div> </div> </li> </ul> </div> <div class="col-xs-6 col-sm-6"> <ul class="right"> <li> <div class="wrapp"> <div class="box"> <div class="circle"></div> </div> </div> </li> <li> <div class="wrapp"> <div class="box"> <div class="circle"></div> </div> </div> </li> <li> <div class="wrapp"> <div class="box"> <div class="circle"></div> </div> </div> </li> </ul> </div> </div> </div>软件包，您可以非常有效地做到这一点

gower

Answer 4

非常感谢这个伟大的问题，也感谢所有提供出色答案的人。

只是为将来的读者解决此问题：

# import example data
data("mtcars")
# store example subset with correct data type 
mtcars_subset <- tibble::tibble(mpg = as.numeric(as.vector(mtcars$mpg)), 
                                cyl = as.numeric(as.vector(mtcars$cyl)), 
                                disp = as.numeric(as.vector(mtcars$disp)))

# transpose data.frame to be conform with philentropy input format
mtcars_subset <- t(mtcars_subset)

# cluster
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower")))
plot(clusters)

# When using the developer version on GitHub you can also specify 'use.row.names = TRUE'
clusters <- hclust(as.dist(philentropy::distance(mtcars_subset, method = "gower", 
use.row.names = TRUE)))
plot(clusters)

如您所见，集群现在可以正常工作了。

问题在于，在示例数据集中，列cyl存储factor的值，而不是double函数所需的philentropy::distance()的值。由于基础代码是用Rcpp编写的，因此不一致的数据类型将引起问题。正如Esther正确指出的那样，我将在以后的版本中实施一种更好的方法来检查类型安全性。

head(tibble::as.tibble(mtcars))

# A tibble: 6 x 11
mpg cyl    disp    hp  drat    wt  qsec    vs    am  gear  carb
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21   6       160   110  3.9   2.62  16.5     0     1     4     4
2  21   6       160   110  3.9   2.88  17.0     0     1     4     4
3  22.8 4       108    93  3.85  2.32  18.6     1     1     4     1
4  21.4 6       258   110  3.08  3.22  19.4     1     0     3     1
5  18.7 8       360   175  3.15  3.44  17.0     0     0     3     2
6  18.1 6       225   105  2.76  3.46  20.2     1     0     3     1

为克服此限制，我将mtcars数据集中感兴趣的列存储在单独的data.frame / tibble中，并通过as.numeric(as.vector(mtcars$mpg))将所有列转换为double值。

结果子集data.frame现在仅根据需要存储double值。

mtcars_subset

# A tibble: 32 x 3
 mpg   cyl  disp
<dbl> <dbl> <dbl>
1  21       6  160 
2  21       6  160 
3  22.8     4  108 
4  21.4     6  258 
5  18.7     8  360 
6  18.1     6  225 
7  14.3     8  360 
8  24.4     4  147.
9  22.8     4  141.
10  19.2     6  168.
# … with 22 more rows

请注意，如果仅提供philentropy::distance()函数两个输入向量，则将仅返回一个距离值，而hclust()函数将无法计算具有一个值的任何聚类。因此，我添加了第三列disp以实现集群的可视化。

我希望这会有所帮助。

有距离的分层聚类-hclust（）和philentropy :: distance（）

4 个答案: