准备数据

Question

我有一个混合数据集（有因子和数字变量类型），我想做一些聚类分析。这样我就可以研究每个集群中的条目来说明它们的共同点。

我知道对于这种类型的数据集，使用的距离是“Gower distance”。

这是我到目前为止所做的：

cluster <- daisy(mydata, metric = c("euclidean", "manhattan", "gower"), 
               stand = FALSE, type = list())
try <- agnes(cluster)
plot(try, hang = -1)

上面给了我一个树形图，但我的数据中有2000个条目，我无法识别树形图结束时的各个条目。此外，我希望能够从树形图中提取聚类。

Answer 1

应该只有一个metric daisy功能。 daisy函数提供（混合类型）观测的距离矩阵。

要从agnes获取群集标签，可以使用cutree功能。请参阅以下示例，使用mtcars数据集;

准备数据

mtcars数据框具有数字刻度上的所有变量。但是，当人们看到变量的描述时，很明显一些在聚类数据时，变量不能用作数字变量。例如，vs，引擎的形状应该是（无序的）因子变量，而齿轮的数量应该是有序因子。

# directly from the ?mtcars
mtcars2 <- within(mtcars, {
  vs <- factor(vs, labels = c("V", "S"))
  am <- factor(am, labels = c("automatic", "manual"))
  cyl  <- ordered(cyl)
  gear <- ordered(gear)
  carb <- ordered(carb)
})

计算相异矩阵

# Compute all the pairwise dissimilarities (distances) between observations 
# in the data set.
diss_mat <- daisy(mtcars2, metric = "gower")

聚类相异矩阵

# Computes agglomerative hierarchical clustering of the dataset.
k <- 3
agnes_clust <- agnes(x = diss_mat)
ag_clust <- cutree(agnes_clust, k)


# Clustering the dissimilarity matrix using 
# partitioning around medoids 
pam_clust <- pam(diss_mat, k)

# A comparision of the two clusterings
table(ag_clust, pam_clust=pam_clust$clustering)
#          pam_clust
# ag_clust  1  2  3
#        1  6  0  0
#        2  2 10  2
#        3  0  0 12

其他包

其他几个用于群集混合类型数据的包 CluMix和FD。

在R中对混合数据集进行聚类

1 个答案:

准备数据

计算相异矩阵

聚类相异矩阵

其他包