优化修剪的K均值,用于聚类具有许多异常值的2D数据?更好的方法?

时间:2017-08-23 09:50:50

标签: r ggplot2 cluster-analysis

我有以下类型的数据/情节

enter image description here

仅仅看一下数据点,几乎不可能判断峰应该在哪里,但如果用ggplot中的2D密度平滑绘制,我会得到这些非常好的峰值,在那里我可以直观地计算~10组点'我想找。 “有效群体”的确切数量当然需要讨论。

这里的数据: https://pastebin.com/5wquw7UF

library(ggplot2)
library(colorRamps)
library(tclust)

ggplot(data = df, aes(x = x, y = y)) +
    stat_density2d(geom = "raster",
                   aes(fill = ..density..),
                   contour = FALSE) +
    geom_point(col = "white", alpha = 0.1) +
    scale_x_continuous(expand = c(0,0),
                       limits = c(0,1)) +
    scale_y_continuous(expand = c(0,0),
                       limits = c(0,1)) +
    theme_tufte(base_size = 11, base_family = "Helvetica") +
    theme(axis.text = element_text(color = "black"),
          panel.border = element_rect(colour = "black", fill=NA, size=0.7),
          legend.key.height = unit(2.5,"line"),
          legend.key.width = unit(1, "line")) +
    scale_fill_gradientn(name = "Density",
                         colours = matlab.like(1000))

我已经使用包tclust查看了修剪过的聚类。通过摆弄数据,我已经能够提出以下内容。然而,无论我多少参与这些参数,我似乎无法获得像我看到的视觉上那样“紧”的群体。特别是第5组似乎潜入了它不属于的地方。第10组也有点奇怪,但孤立到足以在之后丢弃。

有没有更好的方法,或者只是我不理解如何正确设置参数?

set.seed(2)

trimmed_cluster <- tclust(
    x = df,
    k = 10, # 9
    alpha = 0.1, # 0.1
    drop.empty.clust = FALSE,
    equal.weights = TRUE,
    restr = c("sigma", "eigen"), # sigma
    restr.fact = 1
)

df$cluster <- trimmed_cluster$cluster

trimmed_cluster_centers <- data.frame(t(trimmed_cluster$centers))

df_clustered <- subset(df, cluster != 0)

ggplot(data = df, aes(x = x, y = y)) +
    stat_density2d(geom = "raster",
                   aes(fill = ..density..),
                   contour = FALSE) +
    geom_point(data = df_clustered, aes(x = x, y = y, col = as.factor(cluster))) +
    geom_text(data = trimmed_cluster_centers,
              aes(x = x, y = y, label = as.character(1:length(trimmed_cluster_centers$x))),
              size = 5,
              fontface = "bold",
              col = "yellow2") +
    scale_x_continuous(expand = c(0,0),
                       limits = c(0,1)) +
    scale_y_continuous(expand = c(0,0),
                       limits = c(0,1)) +
    theme_tufte(base_size = 11, base_family = "Helvetica") +
    theme(axis.text = element_text(color = "black"),
          panel.border = element_rect(colour = "black", fill=NA, size=0.7),
          legend.key.height = unit(0.8,"line"),
          legend.key.width = unit(0.5, "line")) +
    scale_fill_gradientn(name = "Density",
                         colours = matlab.like(1000)) +
        scale_color_brewer(name = "cluster ID",
                   type = "qual",
                   palette = "Spectral")

enter image description here

2 个答案:

答案 0 :(得分:1)

我建议你使用DBSCAN density-based clustering而不是k-means。

这是一个经过充分测试并经常使用的聚类算法,用于查找任意形状的密度连通分量

名称中的N代表噪声,因为算法可以“忽略”不属于任何群集的点(因为密度低)。它对噪音非常强大,可能对你有帮助。

答案 1 :(得分:0)

如果您正在寻找密度峰值,则均值偏移算法可能会有所帮助。与任何聚类算法一样,您可能希望花一些时间调整参数,但我得到的东西似乎很快就合理了。

library(LPCM)   
MS7 = ms(df, 0.07)
MS7$cluster.center
        [,1]       [,2]
1 0.55790817 0.46878846
2 0.42916901 0.60982702
3 0.04142821 0.63190748
4 0.58098385 0.03693459
5 0.01561478 0.19987934
6 0.18271326 0.01630580
7 0.80381893 0.65499869
8 0.59797721 0.88041362
9 0.86784436 0.95078057

Results of Mean shift