我正在尝试通过分组变量将多个函数应用于多个列。我可以获得结果,但是没有有用的格式。在下面,我希望res2可以由by变量“ cyl”作为res1的扩展,并与cyl的唯一值一样多。
我尝试省略unlist并重新定义my.sum.function以返回数字而不是列表。但是我无法获得所需的格式。
library(data.table)
## The well known data
data(mtcars)
DT <- data.table(mtcars)
## a custom set of summary functions
my.sum.fun = function(x){list(
mean = mean(x, na.rm=T),
median = median(x, na.rm=T),
sd = sd(x, na.rm=T)
)}
## I can summarize multiple columns. This works
res1 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp")]
res1
mpg.mean mpg.median mpg.sd hp.mean hp.median hp.sd
20.090625 19.200000 6.026948 146.687500 123.000000 68.562868
## Now I add a by column. What I would like is the format as res1 but with the by column "cyl" added and with as many rows as unique values of "cyl".
res2 <- DT[,unlist(lapply(.SD,my.sum.fun)),.SDcols=c("mpg","hp"),by=list(cyl)]
res2
cyl V1
1: 6 19.742857
2: 6 19.700000
3: 6 1.453567
4: 6 122.285714
5: 6 110.000000
6: 6 24.260491
7: 4 26.663636
8: 4 26.000000
9: 4 4.509828
10: 4 82.636364
11: 4 91.000000
12: 4 20.934530
13: 8 15.100000
14: 8 15.200000
15: 8 2.560048
16: 8 209.214286
17: 8 192.500000
18: 8 50.976886
答案 0 :(得分:3)
unlist
中有一个选项可以避免递归取消列出-recursive
参数(默认情况下为recursive = TRUE
)
DT[,unlist(lapply(.SD,my.sum.fun),
recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
# cyl mpg.mean mpg.median mpg.sd hp.mean hp.median hp.sd
#1: 6 19.74286 19.7 1.453567 122.28571 110.0 24.26049
#2: 4 26.66364 26.0 4.509828 82.63636 91.0 20.93453
#3: 8 15.10000 15.2 2.560048 209.21429 192.5 50.97689
答案 1 :(得分:1)
我意识到在data.table中使用dplyr似乎有点愚蠢,但是我不认为summarize_all
会比lapply
慢,这仍然可以让您利用数据表的优势快速分组等。
library(dplyr)
my_funs <- list(
mean = function(x) mean(x, na.rm=T),
median = function(x) median(x, na.rm=T),
sd = function(x) sd(x, na.rm=T)
)
dt[, summarise_all(.SD, my_funs), .SDcols = c("mpg", "hp"), by = 'cyl']
# cyl mpg_mean hp_mean mpg_median hp_median mpg_sd hp_sd
# 1: 6 19.74286 122.28571 19.7 110.0 1.453567 24.26049
# 2: 4 26.66364 82.63636 26.0 91.0 4.509828 20.93453
# 3: 8 15.10000 209.21429 15.2 192.5 2.560048 50.97689
答案 2 :(得分:1)
或者,您可以使用mapply
。这样做的另一个好处是,在使用by
或不使用> DT[, mapply(my.sum.fun, .SD), .SDcols=c("mpg","hp"), by=list(cyl)]
cyl V1 V2 V3 V4 V5 V6
1: 6 19.74286 19.7 1.453567 122.28571 110.0 24.26049
2: 4 26.66364 26.0 4.509828 82.63636 91.0 20.93453
3: 8 15.10000 15.2 2.560048 209.21429 192.5 50.97689
的情况下,无需更改语法即可适用。
SIMPLIFY = FALSE
您可能还对DT[, mapply(my.sum.fun, .SD, SIMPLIFY = FALSE), .SDcols=c("mpg","hp"), by=list(cyl)]
cyl mpg hp
1: 6 19.74286 122.2857
2: 6 19.7 110
3: 6 1.453567 24.26049
4: 4 26.66364 82.63636
5: 4 26 91
6: 4 4.509828 20.93453
7: 8 15.1 209.2143
8: 8 15.2 192.5
9: 8 2.560048 50.97689
感兴趣,它将以长格式返回data.table并保留列名-
module A