Question

假设我们在10个葡萄酒样本（行）上有一个包含5个化学测量值（例如，var1，var2，var3，var4，var5）的10x5数据集。我们想基于化学测量使用k均值聚类来对葡萄酒样品进行聚类。这样做很容易。但是，我想进行连续聚类，首先使用单一化学测量对葡萄酒样品进行聚类，然后使用var1，var2，var3，var4和var5（所有一元，二元，三元，四元和全部组合）的所有组合执行聚类操作。五元组合。）

换句话说，我有兴趣根据列中给出的所有可能的测量组合对葡萄酒样本进行聚类，这将产生总共31个聚类结果，例如，基于（1）var1，（2） var2，（3）var3，（4）var4，（5）var5，（6）var1和var2，（7）var1和var3，...，（31）var1，var2，var3，var4和var5。

如何在R中创建这样的循环？

Answer 1

我们假设你有一个数据集：

set.seed(144)
dat <- matrix(rnorm(100), ncol=5)

现在你可以得到列的所有子集（用逻辑向量表示我们是否应该保留每一列），删除第一列（这将删除我们所有的列）。

(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
#     Var1  Var2  Var3  Var4  Var5
# 2   TRUE FALSE FALSE FALSE FALSE
# 3  FALSE  TRUE FALSE FALSE FALSE
# 4   TRUE  TRUE FALSE FALSE FALSE
# ...
# 31 FALSE  TRUE  TRUE  TRUE  TRUE
# 32  TRUE  TRUE  TRUE  TRUE  TRUE

最后一步是为每个列子集运行k-means聚类，这是apply的简单应用程序（我假设您希望每个模型中有3个聚类）：

mods <- apply(cols, 1, function(x) kmeans(dat[,x], 3))

您可以使用列表索引访问每个31 k-means模型。例如：

mods[[1]]
# K-means clustering with 3 clusters of sizes 7, 5, 8
# 
# Cluster means:
#         [,1]
# 1 -1.4039782
# 2 -0.4215221
# 3  0.3227336
# 
# Clustering vector:
#  [1] 1 3 2 1 1 3 3 1 3 3 2 3 2 1 3 3 2 1 1 2
# 
# Within cluster sum of squares by cluster:
# [1] 0.4061644 0.1438443 0.7054191
#  (between_SS / total_SS =  89.9 %)
# 
# Available components:
# 
# [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
# [7] "size"         "iter"         "ifault"

Answer 2

# create a dummy matrix
dummy <- matrix(rnorm(10 * 5), 10, 5)

# create all the combinations of variables
combos <- lapply(1:5, function(x) t(combn(1:5, x)))    

# loop over the combination sets and fit a k-means with 2 clusters to each
kms <- lapply(combos, function(x) {
  lapply(1:nrow(x), function(y) {
    kmeans(dummy[,x[y,]], 2)
  })
})

> sapply(kms, length)
[1]  5 10 10  5  1

# access the results like so:
> kms[[1]][[1]]
K-means clustering with 2 clusters of sizes 3, 7
...

R- R中的连续K均值聚类操作

2 个答案: