Question

我的主要也是最重要的目标实际上是找到彼此之间在同一直线上出现很多点的组，我的想法是在kmeans的帮助下完成任务，但是也许您有更好的主意。

我将在以下两个图的基础上进行解释（您可以在下面找到（每个图描述一组）：

第1组的图1：

我们可以看到在相同的y轴上有很多点->并且我试图找出如何找到具有这种“点分布”的组

下面有 Group 2 的图2，图中没有显示这样的“点分布”

在这里我们可以找到与以上两个图相对应的数据：

structure(list(Group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1), 
    x = c(100L, 150L, 250L, 287L, 312L, 387L, 475L, 550L, 837L, 
    937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L, 1662L, 1700L, 
    1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L, 4762L, 5362L, 
    5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L, 7800L, 7937L, 
    7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L, 9312L, 9400L, 
    9600L, 550L, 612L, 1962L, 5412L, 8425L, 9375L, 5412L), y = c(493L, 
    482L, 479L, 476L, 481L, 479L, 474L, 480L, 480L, 491L, 489L, 
    490L, 485L, 485L, 485L, 479L, 482L, 482L, 482L, 482L, 484L, 
    489L, 491L, 489L, 496L, 498L, 500L, 0L, 498L, 500L, 502L, 
    506L, 497L, 0L, 495L, 506L, 497L, 494L, 498L, 500L, 496L, 
    499L, 496L, 495L, 495L, 498L, 442L, 447L, 394L, 465L, 806L, 
    700L, 502L)), row.names = c(23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 51L, 52L, 53L, 54L, 55L, 
56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 
69L, 574L, 575L, 576L, 577L, 578L, 579L, 815L), class = "data.frame")

简短说明：

Group   x   y
1 100 493
1 150 482
1 250 479
1 287 476
1 312 481
1 387 479

我们在这里有每个组（1＆2），x和y坐标。

直到现在我的方法：

我已使用此代码将y轴舍入为20

    round_any = function(x, accuracy, f=round){f(x/ accuracy) * accuracy} # function to round the y 
data$y_rd <- round_any(data$y, 20)

之所以这样做，是因为通常点并不专门位于同一条y线上。

此外，我已经使用此代码基于每个y_rd（四舍五入的y坐标）的x坐标为每个组创建聚类：

    data$id <- paste(data$Group, data$y_rd, sep = "_") # create id that contains Group and y_rd values
    res2 <- tapply(data$x, INDEX = data$id, function(x) kmeans(x,2)) # kmeans with fixed number of clusters    
    res3 <- lapply(names(res2), function(x) data.frame(y=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))     
    res3 <- do.call(rbind, res3)

但是我无法按需使用它，因为我无法为每个组和y_rd定义群集的固定编号...

在这一点上，我陷入了困境，不知道该如何找到具有这种分布的组...

我想要得到的结果：

Group Cluster MaxPoints
1      1         3
1      2         20
1      3         7

我愿意提出任何想法或技巧，以帮助我找到显示出如此喜好的小组。谢谢！

Answer 1

您的问题的某些要点对我来说还不清楚，所以在这里给出答案，也许是一个起点。

由于似乎最重要的变量是y，您可以尝试在组中进行研究，然后将k-means应用于“优胜者”组。

首先，您可以通过观察一些箱形图或某些直方图来检测可能具有“线”分布的组：

dats %>% ggplot(aes(y_rd)) + geom_histogram() + facet_wrap(vars(Group)) + theme_light()

现在看来，有一个长线和一个较小簇的组（1）和一个有许多小簇的组（2）。因此，在这种情况下，您可以将数据划分为具有两个簇的组（和长线），1和一组没有长线的许多“小簇”（2）。想法是将您的100个组划分为“无长线”，“长线和1个小类”，“长线和2个小类”等。有了这些，您可以拆分数据集并执行聚类。在这种情况下，我们丢弃第二个组，而对第二个组使用具有2个中心的k均值，因为它似乎有一条长线和另一个小的簇。

vec <- c(1)  # vector of groups that seems they've long line

 # a loop to cluster them: clearly this is fixed to two clusters, looking at the
 # histograms you can do n loop, one for similar distributions
listed <- list()
for (i in vec){
  clustering <- kmeans(dats[dats$Group == 1,c(4)],2)
  listed[[i]] <- data.frame(dats[dats$Group == i,c(4)],cl = clustering$cluster)
}

现在您可以绘制它：

library(ggplot2)
ggplot(listed[[1]], aes(x,y, color = as.factor(cl))) + geom_point() + theme_light()

R按组的无监督聚类（？）

1 个答案: