Question

我正在使用R中的K-Means算法，我想弄清楚4个算法Lloyd，Forgy，MacQueen和Hartigan-Wong的差异，它们可用于stats包中的“kmeans”功能。

但是我很明显能够对这个问题给出足够的答案。

我只发现了一些很少的信息：（访问http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/K-Means）

从这个描述中，Lloyd，Forgy和Hartigan-Wong对我来说似乎是一样的。最小化平方和或最小化欧氏距离是相同的。

如果我正确的话，如果一个对象被移动到另一个集群，则MacQueen会更新两个相关的集群。

尽管如此，我仍然没有看到这些算法在哪些方面有所不同。

Answer 1

R 提供Lloyd算法作为kmeans（）的选项;默认算法，by Hartigan和Wong（1979）更聪明。像MacQueen的算法（MacQueen，1967），它会在移动一个点时更新质心;它还可以做出聪明（省时）的选择检查最近的集群。另一方面，Lloyd的k-means算法是所有这些聚类算法中第一个也是最简单的算法。

Lloyd的算法（Lloyd，1957）采用了一系列观察或案例（想想：行数 nxp矩阵或Reals中的点）并将它们聚合到k组中。它试图最小化群内平方和 enter image description here

其中u_i是群集S_i中所有点的平均值。算法如下进行（我会饶恕详尽的符号的形式）： enter image description here

然而，R的实现存在问题，并且在出现问题时会出现问题考虑多个起点。我应该注意，考虑一般是谨慎的几个不同的起点，因为算法保证收敛，但不是保证覆盖全球最佳状态。对于大尺寸，高维度尤其如此问题。我将从一个简单的例子开始（大而不是特别困难）。

（这里我将粘贴一些图像，因为我们不能用乳胶写出数学公式）

enter image description here

请注意，该解决方案与之前实现的解决方案非常相似，尽管排序方式不同集群是任意的。更重要的是，这项工作仅需0.199秒并行！一定这实在是太好了：使用3个处理器内核最多应占据三分之一我们第一次（顺序）运行的时间。这是一个问题吗？这听起来像免费午餐。没有有一段时间免费午餐的问题，是吗？

enter image description here

这并不总是适用于R功能，但有时我们有机会直接查看在代码。这是其中一次。我要把这段代码放到文件中，mykmeans.R，并手动编辑它，在各个地方插入cat（）语句。这是一个聪明的方法这个，使用sink（）（虽然这似乎不适用于Sweave，但它会以交互方式工作）：

> sink("mykmeans.R")
> kmeans
> sink()

现在编辑文件，更改函数名称并添加cat（）语句。注意你还必须删除一个尾随行::

enter image description here

然后我们可以重复我们的探索，但是使用mykmeans（）：

> source("mykmeans.R")
> start.kmeans <- proc.time()[3]
> ans.kmeans <- mykmeans(x, 4, nstart = 3, iter.max = 10, algorithm = "Lloyd")
JJJ statement 1: 0 elapsed time.
JJJ statement 5: 2.424 elapsed time.
JJJ statement 6: 2.425 elapsed time.
JJJ statement 7: 2.52 elapsed time.
JJJ statement 6: 2.52 elapsed time.
JJJ statement 7: 2.563 elapsed time.

enter image description here

现在我们正在开展业务：大部分时间都是在声明5之前消费的（我知道这一点当然，这就是为什么声明5是5而不是2）... 你可以继续玩它

这是代码：

#######################################################################
# kmeans()

N <- 100000
x <- matrix(0, N, 2)
x[seq(1,N,by=4),] <- rnorm(N/2)
x[seq(2,N,by=4),] <- rnorm(N/2, 3, 1)
x[seq(3,N,by=4),] <- rnorm(N/2, -3, 1)
x[seq(4,N,by=4),1] <- rnorm(N/4, 2, 1)
x[seq(4,N,by=4),2] <- rnorm(N/4, -2.5, 1)
start.kmeans <- proc.time()[3]
ans.kmeans <- kmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

these <- sample(1:nrow(x), 10000)
plot(x[these,1], x[these,2], pch=".")
points(ans.kmeans$centers, pch=19, cex=2, col=1:4)

library(foreach)
library(doMC)
registerDoMC(3)
start.kmeans <- proc.time()[3]
ans.kmeans.par <- foreach(i=1:3) %dopar% {
  return(kmeans(x, 4, nstart=1, iter.max=10, algorithm="Lloyd"))
}
TSS <- sapply(ans.kmeans.par, function(a) return(sum(a$withinss)))
ans.kmeans.par <- ans.kmeans.par[[which.min(TSS)]]
ans.kmeans.par$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

sink("mykmeans.Rfake")
kmeans
sink()

source("mykmeans.R")
start.kmeans <- proc.time()[3]
ans.kmeans <- mykmeans(x, 4, nstart=3, iter.max=10, algorithm="Lloyd")
ans.kmeans$centers
end.kmeans <- proc.time()[3]
end.kmeans - start.kmeans

#######################################################################
# Diving

x <- read.csv("Diving2000.csv", header=TRUE, as.is=TRUE)
library(YaleToolkit)
whatis(x)

x[1:14,c(3,6:9)]

meancol <- function(scores) {
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}
x$panelmean <- meancol(x$JScore)

x[1:14,c(3,6:9,11)]

meancol <- function(scores) {
  browser()
  temp <- matrix(scores, length(scores)/7, ncol=7)
  means <- apply(temp, 1, mean)
  ans <- rep(means,7)
  return(ans)
}

x$panelmean <- meancol(x$JScore)

以下是描述：

Number of cases: 10,787 scores from 1,541 dives (7 judges score each
dive) performed in four events at the 2000 Olympic Games in Sydney,
Australia.

Number of variables: 10.

Description: A full description and analysis is available in an
article in The American Statistician (publication details to be
announced).

Variables:

Event       Four events, men's and women's 3M and 10m.
Round       Preliminary, semifinal, and final rounds.
Diver       The name of the diver.
Country     The country of the diver.
Rank        The final rank of the diver in the event.
DiveNo      The number of the dive in sequence within round.
Difficulty  The degree of difficulty of the dive.
JScore      The score provided for the judge on this dive.
Judge       The name of the judge.
JCountry    The country of the judge.

用数据集来试验https://www.dropbox.com/s/urgzagv0a22114n/Diving2000.csv

Answer 2

转到第16页-
这清楚地解释了Gorgy / Llyods，MacQueen，Hatigan-Wong算法。

https://core.ac.uk/download/pdf/27210461.pdf

K-Means：Lloyd，Forgy，MacQueen，Hartigan-Wong

2 个答案: