Question

我是聚类和R的学生。为了获得更好的两者，我想计算每次迭代的质心和我的xy矩阵之间的距离，直到它“收敛”。如何使用R？

解决第2步和第3步

library(fields)
x <- c(3,6,8,1,2,2,6,6,7,7,8,8)
y <- c(5,2,3,5,4,6,1,8,3,6,1,7)

df <- data.frame(x,y) initial matrix
a  <- c(3,6,8)
b  <- c(5,2,3)

df1 <- data.frame(a,b) # initial centroids

这是我想要做的：

I0 <- t(rdist(df, df1))
基于最小距离的群集对象
根据群集平均值确定质心
重复使用I1

我尝试了kmeans功能。但由于某些原因，它会产生那些必须在最后出现的质心。那是我定义的开始：

start   <- matrix(c(3,5,6,2,8,3), 3, byrow = TRUE)
cluster <- kmeans(df,centers = start, iter.max = 1) # one iteration

kmeans不允许我跟踪质心的移动。因此，我想通过应用第2步和第2步“手动”进行操作。 3使用R。

Answer 1

您的主要问题似乎是如何计算数据矩阵与某些点（“中心”）之间的距离。

为此，您可以编写一个函数，将数据矩阵和您的点集作为输入，并将数据矩阵中每行（点）的距离返回到所有“中心”。

这是一个功能：

myEuclid <- function(points1, points2) {
    distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
    for(i in 1:nrow(points2)) {
        distanceMatrix[,i] <- sqrt(rowSums(t(t(points1)-points2[i,])^2))
    }
    distanceMatrix
}

points1是数据矩阵，其中点为行，维度为列。 points2是中心矩阵（再次指向行）。第一行代码只定义了答案矩阵（它将具有与数据矩阵中的行一样多的行以及与中心一样多的列）。因此，结果矩阵中的点i,j将是从 ith 点到第j 中心的距离。

然后for循环迭代所有中心。对于每个中心，它计算从每个点到当前中心的欧氏距离并返回结果。这一行：sqrt(rowSums(t(t(points1)-points2[i,])^2))是欧几里德距离。如果你遇到任何麻烦，请仔细检查并查看公式。（转换主要是为了确保减法是按行进行的）。

现在你也可以实现k-means算法：

myKmeans <- function(x, centers, distFun, nItter=10) {
    clusterHistory <- vector(nItter, mode="list")
    centerHistory <- vector(nItter, mode="list")

    for(i in 1:nItter) {
        distsToCenters <- distFun(x, centers)
        clusters <- apply(distsToCenters, 1, which.min)
        centers <- apply(x, 2, tapply, clusters, mean)
        # Saving history
        clusterHistory[[i]] <- clusters
        centerHistory[[i]] <- centers
    }

    list(clusters=clusterHistory, centers=centerHistory)
}

正如您所看到的，它也是一个非常简单的函数 - 它需要数据矩阵，中心，距离函数（上面定义的函数）和想要的迭代次数。

通过为每个点指定最近的中心来定义聚类。并且中心更新为分配给该中心的点的平均值。这是一种基本的k-means算法。）

我们来试试吧。定义一些随机点（在2d中，因此列数= 2）

mat <- matrix(rnorm(100), ncol=2)

从该矩阵中分配5个随机点作为初始中心：

centers <- mat[sample(nrow(mat), 5),]

现在运行算法：

theResult <- myKmeans(mat, centers, myEuclid, 10)

以下是第10次迭代的中心：

theResult$centers[[10]]
        [,1]        [,2]
1 -0.1343239  1.27925285
2 -0.8004432 -0.77838017
3  0.1956119 -0.19193849
4  0.3886721 -1.80298698
5  1.3640693 -0.04091114

将其与已实施的kmeans函数进行比较：

theResult2 <- kmeans(mat, centers, 10, algorithm="Forgy")

theResult2$centers
        [,1]        [,2]
1 -0.1343239  1.27925285
2 -0.8004432 -0.77838017
3  0.1956119 -0.19193849
4  0.3886721 -1.80298698
5  1.3640693 -0.04091114

工作正常。然而，我们的功能跟踪迭代。我们可以在前4次迭代中绘制进度，如下所示：

par(mfrow=c(2,2))
for(i in 1:4) {
    plot(mat, col=theResult$clusters[[i]], main=paste("itteration:", i), xlab="x", ylab="y")
    points(theResult$centers[[i]], cex=3, pch=19, col=1:nrow(theResult$centers[[i]]))
}

Kmeans

尼斯。

然而，这种简单的设计允许更多。例如，如果我们想要使用另一种距离（不是欧几里德），我们可以使用任何以数据和中心作为输入的函数。这是一个相关距离：

myCor <- function(points1, points2) {
    return(1 - ((cor(t(points1), t(points2))+1)/2))
}

然后我们可以根据这些来做Kmeans：

theResult <- myKmeans(mat, centers, myCor, 10)

4次迭代的结果图像如下所示：

enter image description here

即使您指定了5个群集 - 最后还剩2个群集。这是因为对于2维，相关性可以具有值 - + 1或-1。然后，在寻找聚类时，每个点都被分配到一个中心，即使它与多个中心的距离相同 - 第一个得到了选择。

无论如何，现在这已超出范围。最重要的是，有许多可能的距离指标，一个简单的功能允许您使用您想要的任何距离，并通过迭代跟踪结果。

Answer 2

在上面的距离矩阵函数中进行了修改（增加了一个循环的点数），因为上面的函数仅显示第一个点到所有聚类的距离，而不是所有点的距离，这就是问题所在：

myEuclid <- function(points1, points2) {
    distanceMatrix <- matrix(NA, nrow=dim(points1)[1], ncol=dim(points2)[1])
    for(i in 1:nrow(points2)) {
        for (j in c(1:dim(t(points1))[2])) {
            
        distanceMatrix[j,i] <- sqrt(rowSums(t(t(points1)[,j]-t(points2[i,]))^2))
            }
    }
    distanceMatrix
}

请让我知道它是否可以正常工作！

如何计算质心和数据矩阵之间的距离（对于k表示算法）

2 个答案: