Question

在this post中，有一种方法可以在R中初始化K-means算法的中心。但是，其中使用的数据是标量（即数字）。

此问题的变体：如果数据具有多个维度，该怎么办？在这种情况下，新的中心应该是向量，所以start应该是向量的向量...我尝试过类似的东西：

C1<- c(1,2)
C2<- c(4,-5)

拥有我的两个初始中心，然后使用

kmeans(dat, c(C1,C2))

但它不起作用。我还尝试了cbind()而不是c()。同样的结果......

Answer 1

## Your centers
C1 <- c(1, 2)
C2 <- c(4, -5)

## Simulate some data with groups around these centers
library(MASS)
set.seed(0)
dat <- rbind(mvrnorm(100, mu=C1, Sigma = matrix(c(2,3,3,10), 2)),
             mvrnorm(100, mu=C2, Sigma = matrix(c(10,3,3,2), 2)))

clusts <- kmeans(dat, rbind(C1, C2))  # get clusters with your center starting points

## Look at them
plot(dat, col=clusts$cluster)

enter image description here

Answer 2

您将矩阵start扩展为群集行和变量列（维度），其中 cluster 是数量您尝试识别的群集和变量是数据集中的变量数。

以下是您关联的信息的扩展名，将示例扩展为3个维度（变量），x，y和z：

set.seed(1)
dat <- data.frame(x = rnorm(99, mean = c(-5, 0 , 5)),
                  y = rnorm(99, mean = c(-5, 0, 5)),
                  z = rnorm(99, mean = c(-5, 2, -4)))
plot(dat)

情节是：

enter image description here

现在我们需要为三个集群中的每个集群指定集群中心。这是通过以前的矩阵完成的：

start <- matrix(c(-5, 0, 5, -5, 0, 5, -5, 2, -4), nrow = 3, ncol = 3)

> start
     [,1] [,2] [,3]
[1,]   -5   -5   -5
[2,]    0    0    2
[3,]    5    5   -4

这里需要注意的重点是群集是行。列是指定群集中心的该维度上的坐标。因此，对于簇1，我们指定质心处于（-5，-5，-5）

致电kmeans()

kmeans(dat, start)

导致它选择非常接近我们初始起点的群组（就像这个例子应该这样）：

> kmeans(dat, start)
K-means clustering with 3 clusters of sizes 33, 33, 33

Cluster means:
           x           y         z
1 -4.8371412 -4.98259934 -4.953537
2  0.2106241  0.07808787  2.073369
3  4.9708243  4.77465974 -4.047120

Clustering vector:
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[39] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[77] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

Within cluster sum of squares by cluster:
[1] 117.78043  77.65203  77.00541
 (between_SS / total_SS =  93.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

这里值得注意的是集群中心的输出：

Cluster means:
           x           y         z
1 -4.8371412 -4.98259934 -4.953537
2  0.2106241  0.07808787  2.073369
3  4.9708243  4.77465974 -4.047120

此布局与矩阵start完全相同。

您不必使用matrix()直接构建矩阵，也不必按列指定中心。例如：

c1 <- c(-5, -5, -5)
c2 <- c( 0,  0,  2)
c3 <- c( 5,  5, -4)
start2 <- rbind(c1, c2, c3)

> start2
   [,1] [,2] [,3]
c1   -5   -5   -5
c2    0    0    2
c3    5    5   -4

或者

start3 <- matrix(c(-5, -5, -5,
                    0,  0,  2,
                    5,   5, -4), ncol = 3, nrow = 3, byrow = TRUE)

> start3
     [,1] [,2] [,3]
[1,]   -5   -5   -5
[2,]    0    0    2
[3,]    5    5   -4

如果那些对你来说更舒服。

要记住的关键是变量在列中，行中的集群中心。

初始化kmeans，* vector * initial centroids，R

2 个答案: