汇总一些列,同时保持其他列不变

时间:2019-02-18 20:28:53

标签: r datatable dplyr

我有一个像这样的虚拟样本一样的数据框,我的真实数据集有56个变量。 我想删除日期并通过id进行汇总,并对最后4个变量进行求和,而其他变量则保持不变。

df <- data.frame(stringsAsFactors=FALSE,
          date = c("2019-02-10", "2019-02-10", "2019-02-11", "2019-02-11",
                   "2019-02-12", "2019-02-12", "2019-02-13", "2019-02-13",
                   "2019-02-14", "2019-02-14"),
            id = c("18100410-aa", "18101080-ae", "18100410-aa", "18101080-ae",
                   "18100410-aa", "18101080-ae", "18100410-aa", "18101080-ae",
                   "18100410-aa", "18101080-ae"),
        f_type = c(4L, 2L, 4L, 2L, 4L, 2L, 4L, 2L, 4L, 2L),
           reg = c(6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L, 6L, 7L),
        hh_p10 = c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L),
      internet = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L),
      youngest = c(5L, 7L, 5L, 7L, 5L, 7L, 5L, 7L, 5L, 7L),
       a_group = c(3L, 6L, 3L, 6L, 3L, 6L, 3L, 6L, 3L, 6L),
     total_prd = c(130L, 337L, 374L, 261L, 106L, 230L, 150L, 36L, 15L, 123L),
   B_totalprod = c(20L, 0L, 256L, 0L, 32L, 0L, 0L, 36L, 0L, 45L),
   p_totalprod = c(0L, 81L, 11L, 260L, 26L, 230L, 0L, 0L, 15L, 0L),
   n_totalprod = c(110L, 256L, 107L, 1L, 48L, 0L, 150L, 0L, 0L, 78L)
)

我从plyr软件包here中发现了这个解决方案,它可以正常工作,但是我需要指定所有52个不受影响的变量。我只是想知道还有其他方法可以完成此任务吗?

library(plyr)
ddply(df,.(id,f_type, reg, internet,hh_p10 ,youngest, a_group ),summarise,total_prd = sum(total_prd) ,
      B_totalprod = sum(B_totalprod) , p_totalprod = sum(p_totalprod) ,
      n_totalprod = sum(n_totalprod))

1 个答案:

答案 0 :(得分:2)

如果您的真实数据集还具有包含“总计”的列,则此方法应该起作用:

library(tidyverse)
df %>%
  select(-date) %>%
  group_by(.dots = str_subset(names(.), "total", negate = TRUE)) %>%
  summarise_all(list(sum = sum))

# A tibble: 2 x 11
# Groups:   id, f_type, reg, hh_p10, internet, youngest [2]
  id          f_type   reg hh_p10 internet youngest a_group total_prd_sum B_totalprod_sum p_totalprod_sum n_totalprod_sum
  <chr>        <int> <int>  <int>    <int>    <int>   <int>         <int>           <int>           <int>           <int>
1 18100410-aa      4     6      2        1        5       3           775             308              52             415
2 18101080-ae      2     7      1        2        7       6           987              81             571             335

group_by(.dots = str_subset(names(.), "total", negate = TRUE))表示我们将按 this 数据集中所有不包含单词“ total”的列名进行分组。