Question

我有一个包含两列的数据框，我们称之为X和Y。这是一个例子：

df <- data.frame(X = LETTERS[1:8],
                 Y = c(14, 12, 12, 11, 9, 6, 4, 1),
                 stringsAsFactors = FALSE)

产生这个：

请注意，数据框始终按照Y的降序排序。我希望将Y值位于特定范围内的案例分组，同时更新X列以反映分组。例如，如果值为2，我希望最终输出为：

     X    new_Y
     A 14.00000
 B C D 11.66667
     E  9.00000
   F G  5.00000
     H  1.00000

让我解释一下我是如何做到的。从起始df数据框开始，最接近的值为B和C。加入他们会导致：

案例new_Y和B的{{1}}值是C和B的原始值的平均值，即12.从第二个数据框，C距离B C不超过2个，因此它们是下一个组合在一起的：

请注意X new_Y A 14.00000 B C D 11.66667 E 9.00000 F 6.00000 G 4.00000 H 1.00000的{{1}}值为11.67，因为Y，B C D和B的原始值分别为12,12和11，他们的平均值是11.667。我不希望代码返回上一次迭代的平均值C（在本例中为11.5）。

最后，D和Y也可以组合在一起，产生上述最终输出。

我不确定实现此目的所需的代码。我唯一的想法是计算与前一个元素和后一个元素的距离，寻找最小值并检查它是否超过阈值（在上例中为2）。根据显示最小值的位置，加入F列，同时平均原始表中的G值。重复此操作，直到最小值大于阈值。

但我不确定如何编写必要的代码来实现这一目标，或者我是否能够更有效地解决上述建议的算法。任何帮助将不胜感激。

P.S我忘了提到如果前一个和后一个X值之间的距离相同，则应该对更大的Y值进行分组。所以

将以

的形式返回

提前感谢您的耐心等待。如果我不能很好地解释这一点，我很抱歉。

Answer 1

这听起来像hierarchical agglomerative clustering。

要获取群组，请使用dist，hclust和cutree。

请注意，使用hclust的质心聚类需要将距离视为欧几里德距离的平方。

df <- data.frame(X = LETTERS[1:8],
                 Y = c(14, 12, 12, 11, 9, 6, 4, 1),
                 stringsAsFactors = FALSE)

dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5

要使用原始表格，我们可以使用dplyr。

library('dplyr')
df %>%
  group_by(group_id = group_id) %>%
  summarise(
    X = paste(X, collapse = ' '),
    Y = mean(Y))
#> # A tibble: 5 x 3
#>   group_id     X        Y
#>      <int> <chr>    <dbl>
#> 1        1     A 14.00000
#> 2        2 B C D 11.66667
#> 3        3     E  9.00000
#> 4        4   F G  5.00000
#> 5        5     H  1.00000

Answer 2

这给出了前一次迭代的平均值。无论如何，我希望它有所帮助

library(data.table)
df <- data.table(X = LETTERS[1:8],
                 Y = c(14, 12, 12, 11, 9, 6, 4, 1),
                 stringsAsFactors = FALSE)



differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum

while (minimum < 2){
  index <- which(df$difference==minimum) # see where the minimum occurs
  check = FALSE
  # because the last row cannot have a number since there is not an element after that
  # we need to see if this element has the minimum difference with its previous
  # if it does not have the minimum difference then we exclude it and paste it later
  if(df[nrow(df)-1,difference]!=minimum){
    last_row <- df[nrow(df)]
    df <- df[-nrow(df)]
    check = TRUE
  }
  tmp  <- df[(index:(index+1))]
  df <- df[-(index:(index+1))]
  to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
  to_bind$Y <- mean(tmp$Y)
  df <- rbind(df[,.(X,Y)],to_bind)
  if(check){
    df <- rbind(df,last_row[,.(X,Y)])
  }
  setorder(df,-Y)

  differences <- c(diff(df$Y),NA) # NA for the last element
  df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
  minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}

R：在cetrain范围内对数据进行分组

2 个答案: