我有以下类型的数据/情节
仅仅看一下数据点,几乎不可能判断峰应该在哪里,但如果用ggplot中的2D密度平滑绘制,我会得到这些非常好的峰值,在那里我可以直观地计算~10组点'我想找。 “有效群体”的确切数量当然需要讨论。
这里的数据: https://pastebin.com/5wquw7UF
library(ggplot2)
library(colorRamps)
library(tclust)
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom = "raster",
aes(fill = ..density..),
contour = FALSE) +
geom_point(col = "white", alpha = 0.1) +
scale_x_continuous(expand = c(0,0),
limits = c(0,1)) +
scale_y_continuous(expand = c(0,0),
limits = c(0,1)) +
theme_tufte(base_size = 11, base_family = "Helvetica") +
theme(axis.text = element_text(color = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=0.7),
legend.key.height = unit(2.5,"line"),
legend.key.width = unit(1, "line")) +
scale_fill_gradientn(name = "Density",
colours = matlab.like(1000))
我已经使用包tclust
查看了修剪过的聚类。通过摆弄数据,我已经能够提出以下内容。然而,无论我多少参与这些参数,我似乎无法获得像我看到的视觉上那样“紧”的群体。特别是第5组似乎潜入了它不属于的地方。第10组也有点奇怪,但孤立到足以在之后丢弃。
有没有更好的方法,或者只是我不理解如何正确设置参数?
set.seed(2)
trimmed_cluster <- tclust(
x = df,
k = 10, # 9
alpha = 0.1, # 0.1
drop.empty.clust = FALSE,
equal.weights = TRUE,
restr = c("sigma", "eigen"), # sigma
restr.fact = 1
)
df$cluster <- trimmed_cluster$cluster
trimmed_cluster_centers <- data.frame(t(trimmed_cluster$centers))
df_clustered <- subset(df, cluster != 0)
ggplot(data = df, aes(x = x, y = y)) +
stat_density2d(geom = "raster",
aes(fill = ..density..),
contour = FALSE) +
geom_point(data = df_clustered, aes(x = x, y = y, col = as.factor(cluster))) +
geom_text(data = trimmed_cluster_centers,
aes(x = x, y = y, label = as.character(1:length(trimmed_cluster_centers$x))),
size = 5,
fontface = "bold",
col = "yellow2") +
scale_x_continuous(expand = c(0,0),
limits = c(0,1)) +
scale_y_continuous(expand = c(0,0),
limits = c(0,1)) +
theme_tufte(base_size = 11, base_family = "Helvetica") +
theme(axis.text = element_text(color = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=0.7),
legend.key.height = unit(0.8,"line"),
legend.key.width = unit(0.5, "line")) +
scale_fill_gradientn(name = "Density",
colours = matlab.like(1000)) +
scale_color_brewer(name = "cluster ID",
type = "qual",
palette = "Spectral")
答案 0 :(得分:1)
我建议你使用DBSCAN density-based clustering而不是k-means。
这是一个经过充分测试并经常使用的聚类算法,用于查找任意形状的密度连通分量。
名称中的N代表噪声,因为算法可以“忽略”不属于任何群集的点(因为密度低)。它对噪音非常强大,可能对你有帮助。
答案 1 :(得分:0)
如果您正在寻找密度峰值,则均值偏移算法可能会有所帮助。与任何聚类算法一样,您可能希望花一些时间调整参数,但我得到的东西似乎很快就合理了。
library(LPCM)
MS7 = ms(df, 0.07)
MS7$cluster.center
[,1] [,2]
1 0.55790817 0.46878846
2 0.42916901 0.60982702
3 0.04142821 0.63190748
4 0.58098385 0.03693459
5 0.01561478 0.19987934
6 0.18271326 0.01630580
7 0.80381893 0.65499869
8 0.59797721 0.88041362
9 0.86784436 0.95078057