如何对by中的多个列应用多个功能?

时间:2019-07-01 19:10:41

标签: r data.table

我正在尝试通过分组变量将多个函数应用于多个列。我可以获得结果,但是没有有用的格式。在下面,我希望res2可以由by变量“ cyl”作为res1的扩展,并与cyl的唯一值一样多。

我尝试省略unlist并重新定义my.sum.function以返回数字而不是列表。但是我无法获得所需的格式。

library(data.table)

## The well known data 
data(mtcars)
DT <- data.table(mtcars)

## a custom set of summary functions
my.sum.fun = function(x){list(
    mean   = mean(x, na.rm=T),
    median = median(x, na.rm=T),
    sd     = sd(x, na.rm=T)
    )}

## I can summarize multiple columns. This works
res1 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp")]
res1
 mpg.mean mpg.median     mpg.sd    hp.mean  hp.median      hp.sd 
 20.090625  19.200000   6.026948 146.687500 123.000000  68.562868 

## Now I add a by column. What I would like is the format as res1 but with the by column "cyl" added and with as many rows as unique values of "cyl".
res2 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp"),by=list(cyl)]
res2
    cyl         V1
 1:   6  19.742857
 2:   6  19.700000
 3:   6   1.453567
 4:   6 122.285714
 5:   6 110.000000
 6:   6  24.260491
 7:   4  26.663636
 8:   4  26.000000
 9:   4   4.509828
10:   4  82.636364
11:   4  91.000000
12:   4  20.934530
13:   8  15.100000
14:   8  15.200000
15:   8   2.560048
16:   8 209.214286
17:   8 192.500000
18:   8  50.976886

3 个答案:

答案 0 :(得分:3)

unlist中有一个选项可以避免递归取消列出-recursive参数(默认情况下为recursive = TRUE

DT[,unlist(lapply(.SD,my.sum.fun), 
      recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
#   cyl mpg.mean mpg.median   mpg.sd   hp.mean hp.median    hp.sd
#1:   6 19.74286       19.7 1.453567 122.28571     110.0 24.26049
#2:   4 26.66364       26.0 4.509828  82.63636      91.0 20.93453
#3:   8 15.10000       15.2 2.560048 209.21429     192.5 50.97689

答案 1 :(得分:1)

我意识到在data.table中使用dplyr似乎有点愚蠢,但是我不认为summarize_all会比lapply慢,这仍然可以让您利用数据表的优势快速分组等。

library(dplyr)

my_funs <- list(
    mean   = function(x) mean(x, na.rm=T),
    median = function(x) median(x, na.rm=T),
    sd     = function(x) sd(x, na.rm=T)
  )

dt[, summarise_all(.SD, my_funs), .SDcols = c("mpg", "hp"), by = 'cyl']

#    cyl mpg_mean   hp_mean mpg_median hp_median   mpg_sd    hp_sd
# 1:   6 19.74286 122.28571       19.7     110.0 1.453567 24.26049
# 2:   4 26.66364  82.63636       26.0      91.0 4.509828 20.93453
# 3:   8 15.10000 209.21429       15.2     192.5 2.560048 50.97689

答案 2 :(得分:1)

或者,您可以使用mapply。这样做的另一个好处是,在使用by或不使用> DT[, mapply(my.sum.fun, .SD), .SDcols=c("mpg","hp"), by=list(cyl)] cyl V1 V2 V3 V4 V5 V6 1: 6 19.74286 19.7 1.453567 122.28571 110.0 24.26049 2: 4 26.66364 26.0 4.509828 82.63636 91.0 20.93453 3: 8 15.10000 15.2 2.560048 209.21429 192.5 50.97689 的情况下,无需更改语法即可适用。

SIMPLIFY = FALSE

您可能还对DT[, mapply(my.sum.fun, .SD, SIMPLIFY = FALSE), .SDcols=c("mpg","hp"), by=list(cyl)] cyl mpg hp 1: 6 19.74286 122.2857 2: 6 19.7 110 3: 6 1.453567 24.26049 4: 4 26.66364 82.63636 5: 4 26 91 6: 4 4.509828 20.93453 7: 8 15.1 209.2143 8: 8 15.2 192.5 9: 8 2.560048 50.97689 感兴趣,它将以长格式返回data.table并保留列名-

module A