Question

我试图了解stat：kmeans与Wikipedia上解释的简单版本有何不同。老实说，我是如此无能为力。

阅读有关kmeans的帮助后，我了解到默认算法不是Hartigan-Wong，这不是更基本的方法，因此应该有所区别，但是在处理一些正态分布变量时，我找不到它们之间存在显着差异的情况并且可以预见。

作为参考，这是我对其进行测试的完全可怕的代码

##squre of eudlidean metric
my_metric <- function(x=vector(),y=vector()) {
  stopifnot(length(x)==length(y))
  sum((x-y)^2)
}

## data: xy data
## k: amount of groups
my_kmeans <- function(data, k, maxIt=10) {

  ##get length and check if data lengths are equal and if enough data is provided
  l<-length(data[,1])
  stopifnot(l==length(data[,2]))
  stopifnot(l>k)

  ## generate the starting points
  ms <- data[sample(1:l,k),]

  ##append the data with g column and initilize last
  data$g<-0
  last <- data$g

  it<-0
  repeat{
    it<-it+1
    ##iterate through each data point and assign to cluster
    for(i in 1:l){
      distances <- c(Inf,Inf,Inf)
      for(j in 1:k){
        distances[j]<-my_metric(data[i,c(1,2)],ms[j,])
      }
      data$g[i] <- which.min(distances)

    }

    ##update cluster points
    for(i in 1:k){
      points_in_cluster <- data[data$g==i,1:2]
      ms[i,] <- c(mean(points_in_cluster[,1]),mean(points_in_cluster[,2]))
    }

    ##break condition: nothing changed
    if(my_metric(last,data$g)==0 | it > maxIt){
      break
    }
    last<-data$g
  }

  data
}

Answer 1

首先，这是this post的重复项（正如我刚刚发现的那样）。但我仍将尝试举一个例子：当群集分离时，劳埃德（Lloyd）倾向于将中心留在它们开始的群集内，这意味着某些群集可能最终被分割而另一些群集在一起。

stat：kmeans和“天真” k均值有什么区别

1 个答案: