在固定质心周围重新组合簇

时间:2015-10-28 18:36:28

标签: r classification cluster-analysis data-mining

群集/分类问题: 使用k-means聚类生成这些聚类和质心:

这是具有初始运行中添加的集群属性的数据集:

  > dput(sampledata)
    structure(list(Player = structure(1:5, .Label = c("A", "B", "C", 
    "D", "E"), class = "factor"), Metric.1 = c(0.3938961, 0.28062338, 
    0.32532626, 0.29239642, 0.25622558), Metric.2 = c(0.00763359, 
    0.01172354, 0.40550867, 0.04026846, 0.05976367), Metric.3 = c(0.50766075, 
    0.20345662, 0.06267444, 0.08661417, 0.17588925), cluster = c(1L, 
    2L, 3L, 2L, 2L)), .Names = c("Player", "Metric.1", "Metric.2", 
    "Metric.3", "cluster"), row.names = c(NA, -5L), class = "data.frame")

这些是3个指标的集群详细信息:

> dput (scluster)
structure(list(cluster = c(1L, 2L, 3L, 2L, 2L), centers = structure(c(0.3938961, 
0.276415126666667, 0.32532626, 0.00763359, 0.03725189, 0.40550867, 
0.50766075, 0.155320013333333, 0.06267444), .Dim = c(3L, 3L), .Dimnames = list(
    c("1", "2", "3"), c("Metric.1", "Metric.2", "Metric.3"))), 
    totss = 0.252759813332907, withinss = c(0, 0.00930902482096013, 
    0), tot.withinss = 0.00930902482096013, betweenss = 0.243450788511947, 
    size = c(1L, 3L, 1L), iter = 1L, ifault = 0L), .Names = c("cluster", 
"centers", "totss", "withinss", "tot.withinss", "betweenss", 
"size", "iter", "ifault"), class = "kmeans")

Data with cluster attribute and centroids

我的目标是找到一种方法来在每个群集的第一个群集运行后修复这些质心,这样这些质心可以用作固定的未来参考,以查看这些群体如何移入和移出这些群集到不同的群集指标发生变化,从而跟踪其进度或退步。

具体来说,如果玩家A的指标发生变化,使得它现在类似于集群2而不是1,基于与各个固定质心的欧几里德距离,我应该能够看到玩家A移动到集群2.这将意味着数据点围绕这些从第一次运行中获得的最初固定的质心进行了重新设计。

这应该有助于用户了解如何处理这样的数据挖掘问题。任何指针将不胜感激!谢谢。

1 个答案:

答案 0 :(得分:2)

你走了:

# install a couple of packages needed for the example
library(devtools)
devtools::install_github("alexwhitworth/emclustr")
devtools::install_github("alexwhitworth/imputation")
library(emclustr)
library(imputation)

# generate some example data -- 30 points in 3 2-dimensional clusters
# clusters are MVN
set.seed(123)
x <- rbind(gen_clust(10, 2, c(-5,5), c(1,1)),
           gen_clust(10, 2, c(0,0), c(1,1)),
           gen_clust(10, 2, c(5,5), c(1,1)))

# get initial centroids
km <- kmeans(x, centers= 3)$centers

# generate a new set of example data, in this case a "subsequent step"
# from your time-series
x2 <- rbind(gen_clust(10, 2, c(-4,-4), c(1,1)),
           gen_clust(10, 2, c(1,1), c(1,1)),
           gen_clust(10, 2, c(4,4), c(1,1)))

# calculate the Euclidean distance of each point to each centroid
# and evaluate nearest distance
d_km <- as.data.frame(cbind(dist_q.matrix(x= rbind(km[1,], x2), ref= 1L, q=2),
              dist_q.matrix(x= rbind(km[2,], x2), ref= 1L, q=2),
              dist_q.matrix(x= rbind(km[3,], x2), ref= 1L, q=2)))
names(d_km) <- c("dist_centroid1", "dist_centroid2", "dist_centroid3")
d_km$clust <- apply(d_km, 1, which.min)

# plot the centroids and the new points "x2" to show the results
plot(km, pch= 11, xlim= c(-6,6), ylim= c(-6,6))
points(x2, col= factor(d_km$clust))

enter image description here