分组,按频率汇总

时间:2020-06-30 06:25:55

标签: r dplyr

我想根据一些ID对数据集进行分组,然后将具有最大值的分组数据保留在该列中。这是我的数据集的描述。

   BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5  ASTN TRNID TRNID2 TRNID3 TRNID4 TRNID5 count
1   150     0     0     0     0     0     0     0     0   152  1674      0      0      0      0     1
2   150     0     0     0     0     0     0     0     0   152  1676      0      0      0      0     2
3   150     0     0     0     0     0     0     0     0   152  1678      0      0      0      0     2
4   150     0     0     0     0     0     0     0     0   152  1680      0      0      0      0    13
5   150     0     0     0     0     0     0     0     0   152  1682      0      0      0      0     3
6   150     0     0     0     0     0     0     0     0   152  1684      0      0      0      0     4

我想根据ID的前10列将数据分组并汇总为一行。 BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5 ASTN
然后,对于其余的列, TRNID TRNID2 TRNID3 TRNID4 TRNID5 ,我想用 count 列中具有最大值的行替换它们。

我想要的最终输出如下所示。

BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5  ASTN TRNID TRNID2 TRNID3 TRNID4 TRNID5 count
 150   0     0     0     0     0     0     0     0    152  1680     0      0      0      0    13

如何汇总我的数据?我有2,931,959行,其中包含更多组的BSTN,ASTN。

dput(head(A_Routetable2))
structure(list(BSTN = c(150, 150, 150, 150, 150, 150), ASTN1 = c(0, 
0, 0, 0, 0, 0), BSTN2 = c(0, 0, 0, 0, 0, 0), ASTN2 = c(0, 0, 
0, 0, 0, 0), BSTN3 = c(0, 0, 0, 0, 0, 0), ASTN3 = c(0, 0, 0, 
0, 0, 0), BSTN4 = c(0, 0, 0, 0, 0, 0), ASTN4 = c(0, 0, 0, 0, 
0, 0), BSTN5 = c(0, 0, 0, 0, 0, 0), ASTN = c(152, 152, 152, 152, 
152, 152), TRNID = c(1674, 1676, 1678, 1680, 1682, 1684), TRNID2 = c(0, 
0, 0, 0, 0, 0), TRNID3 = c(0, 0, 0, 0, 0, 0), TRNID4 = c(0, 0, 
0, 0, 0, 0), TRNID5 = c(0, 0, 0, 0, 0, 0), count = c(1L, 2L, 
2L, 13L, 3L, 4L)), row.names = c(NA, -6L), groups = structure(list(
    BSTN = c(150, 150, 150, 150, 150, 150), ASTN1 = c(0, 0, 0, 
    0, 0, 0), BSTN2 = c(0, 0, 0, 0, 0, 0), ASTN2 = c(0, 0, 0, 
    0, 0, 0), BSTN3 = c(0, 0, 0, 0, 0, 0), ASTN3 = c(0, 0, 0, 
    0, 0, 0), BSTN4 = c(0, 0, 0, 0, 0, 0), ASTN4 = c(0, 0, 0, 
    0, 0, 0), BSTN5 = c(0, 0, 0, 0, 0, 0), ASTN = c(152, 152, 
    152, 152, 152, 152), TRNID = c(1674, 1676, 1678, 1680, 1682, 
    1684), TRNID2 = c(0, 0, 0, 0, 0, 0), TRNID3 = c(0, 0, 0, 
    0, 0, 0), TRNID4 = c(0, 0, 0, 0, 0, 0), .rows = structure(list(
        1L, 2L, 3L, 4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, 6L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

1 个答案:

答案 0 :(得分:2)

您可以group_by定位,然后在count中选择具有最大值的行。

library(dplyr)
df %>% group_by(across(1:10)) %>% slice(which.max(count))

#   BSTN ASTN1 BSTN2 ASTN2 BSTN3 ASTN3 BSTN4 ASTN4 BSTN5  ASTN TRNID TRNID2 TRNID3 TRNID4 TRNID5 count
#  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>  <int>  <int>  <int>  <int> <int>
#1   150     0     0     0     0     0     0     0     0   152  1680      0      0      0      0    13

或按列范围分组

df %>% group_by(across(BSTN:ASTN)) %>%slice(which.max(count))

OP共享的dput被分组,导致across错误。我们可以先ungroup数据,然后运行上面的数据,而不会出现任何错误。但是,dplyr的先前版本中的功能可以正常工作。例如-group_by_at

A_Routetable2 %>% group_by_at(1:10) %>% slice(which.max(count))