dplyr:汇总多个组的长格式

时间:2019-07-10 09:56:05

标签: r dplyr data.table

我知道有很多问题,可能以一种或另一种方式听起来很相似,但我一直无法找到确切问题的答案。

让我们说我们有一个玩具数据集

library(tidyverse)
df <- tibble(
  Gender = c("m", "f", "f", "m", "m", 
             "f", "f", "f", "m", "f"),
  IQ = rnorm(10, 100, 15),
  Other = runif(10),
  Test = rnorm(10),
  group2 = c("A", "A", "A", "A", "A",
             "B", "B", "B", "B", "B")
)

我们要从中计算meanmin的{​​{1}},maxgender

仅对于一组,我可以轻松编写

group2

获取

df %>% 
  group_by(Gender) %>% 
  select_if(is.numeric) %>% 
  gather(Variable, Value, -Gender) %>% 
  group_by(Variable, Gender) %>% 
  summarise(mean = mean(Value), 
        min = min(Value), 
        max = max(Value)) %>% 
 ungroup()

但是我不知道如何对多个组执行相同的操作。我知道我可以像这样使用 Variable Gender mean min max <chr> <chr> <dbl> <dbl> <dbl> 1 IQ f 99.2 81.9 121. 2 IQ m 89.0 62.5 106. 3 Other f 0.301 0.187 0.479 4 Other m 0.395 0.0483 0.757 5 Test f -0.0770 -1.18 0.545 6 Test m 0.163 -0.632 0.828

summarise_*()

但是它返回宽格式(例如df %>% group_by(Gender) %>% summarise_if(is.numeric, list(mean = mean, min = min, max = max))

data.table

,当您有10个以上的变量时,它几乎毫无用处。

我在这里想念什么?

3 个答案:

答案 0 :(得分:2)

您可以通过在自己的代码中添加gatherseparatespread来实现:

df %>% 
    group_by(Gender, group2) %>% 
    summarise_if(is.numeric, list(mean = mean, 
                                  min = min, 
                                  max = max)) %>% 
    gather(vars, vals, -Gender, -group2) %>% 
    separate(vars, c("Variable", "stat")) %>% 
    spread(stat, vals)

#### OUTPUT ####

# A tibble: 12 x 6
# Groups:   Gender [2]
   Gender group2 Variable     max    mean       min
   <chr>  <chr>  <chr>      <dbl>   <dbl>     <dbl>
 1 f      A      IQ       110.    103.     95.0    
 2 f      A      Other      0.934   0.469   0.00439
 3 f      A      Test       1.39    0.472  -0.446  
 4 f      B      IQ       121.     92.0    75.6    
 5 f      B      Other      0.730   0.461   0.261  
 6 f      B      Test       0.589   0.276  -0.524  
 7 m      A      IQ       112.    104.     94.3    
 8 m      A      Other      0.827   0.613   0.308  
 9 m      A      Test       0.724   0.136  -0.264  
10 m      B      IQ       115.    115.    115.     
11 m      B      Other      0.970   0.970   0.970  
12 m      B      Test      -1.05   -1.05   -1.05   

答案 1 :(得分:1)

您可以先将HashMapdfIQ收集到一个变量列中,然后将Other转换为长格式,然后计算每个组的摘要统计信息(组2变量):

Test

答案 2 :(得分:0)

这是一种data.table方法

library( data.table )
melt( setDT(df), 
  id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE), 
                                        min = min(value, na.rm = TRUE),
                                        mean = mean(value, na.rm = TRUE)),
                                    by = .(Gender, group2, variable )][]

#    Gender group2 variable           max          min         mean
# 1:      m      A       IQ 120.739562935  83.46037366  96.99412720
# 2:      f      A       IQ  98.657598754  98.43677811  98.54718843
# 3:      f      B       IQ 111.973534436  71.38605822  94.04719457
# 4:      m      B       IQ 102.913093964 102.91309396 102.91309396
# 5:      m      A    Other   0.861929066   0.51651983   0.66098944
# 6:      f      A    Other   0.752484881   0.07648229   0.41448359
# 7:      f      B    Other   0.463524836   0.18308752   0.33301693
# 8:      m      B    Other   0.099740011   0.09974001   0.09974001
# 9:      m      A     Test   1.159379020  -0.83569116   0.04268551
# 10:      f      A     Test  -0.009017293  -0.77245300  -0.39073515
# 11:      f      B     Test   1.591132150  -0.99248570  -0.24997246
# 12:      m      B     Test   1.654489766   1.65448977   1.65448977

基准

# Unit: milliseconds
#       expr       min        lq      mean    median        uq       max neval
# data.table  1.498788  1.819936  1.997320  1.980358  2.218809  2.413124    10
# tidyverse1 11.263956 11.887270 12.421442 11.963340 12.484075 15.401816    10
# tidyverse2  4.952477  5.185053  6.303103  6.001478  6.902558  9.663341    10

microbenchmark::microbenchmark(
  data.table = {
    DT <- copy(df)
    melt( setDT(DT), 
          id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE), 
                                                min = min(value, na.rm = TRUE),
                                                mean = mean(value, na.rm = TRUE)),
                                            by = .(Gender, group2, variable )][]

  },
  tidyverse1 = {
    DT <- copy(df)
    df %>% 
      group_by(Gender, group2) %>% 
      summarise_if(is.numeric, list(mean = mean, 
                                    min = min, 
                                    max = max)) %>% 
      gather(vars, vals, -Gender, -group2) %>% 
      separate(vars, c("Variable", "stat")) %>% 
      spread(stat, vals)
  },
  tidyverse2 = {
    df %>%
      gather(key = "variable", value = "value", -c(Gender, group2)) %>%
      group_by(Gender, group2, variable) %>%
      summarize_at("value", list(mean = mean, min = min, max = max)) %>%
      ungroup()
  },
  times = 10 
)