使用具有多个折叠函数的数据表聚合data.frame行

时间:2016-02-26 22:31:00

标签: r dataframe data.table aggregate

我有一个很大的data.frame这个示例结构:

df <- data.frame(id = rep(c("a","b","c"),4), sex = rep(c("M","F"),6), score = 1:12)

我想通过id列有效地汇总它,逗号分隔粘贴唯一的sex值并保留最大score值。

如何修改此data.table函数来实现:

setDT(df)[, lapply(.SD, function(x) paste(unique(x), collapse = ",")), by = list(id)]

2 个答案:

答案 0 :(得分:2)

您确定要使用strsplit吗?如何将sex值保留为list?像这样:

df[ , .(list(sex), max(score)), by = id]
#    id      V1 V2
# 1:  a M,F,M,F 10
# 2:  b F,M,F,M 11
# 3:  c M,F,M,F 12

(我们当然可以根据您的喜好命名列)

关于时间安排,我list pastedata.table pastedplyrdplyr }在一个非平凡大小的数据集上占主导地位:

set.seed(102349)
NN <- 1e6
DT <- data.table(id = sample(c("a","b","c"), NN, TRUE),
                 sex = sample(c("M","F"), NN, TRUE),
                 score = sample(12, NN, TRUE))

library(microbenchmark)

microbenchmark(times = 1000L,
               mikec = DT[ , .(list(unique(sex)), max(score)), by = id],
               mikec_str = DT[ , .(paste(unique(sex), collapse = ","),
                                   score = max(score)), by = id],
               count = DT %>% group_by(id) %>% 
                 summarise(score = max(score), 
                           sex = paste(unique(sex),collapse=",")))
# Unit: milliseconds
#       expr      min       lq     mean   median       uq      max neval cld
#      mikec 20.31309 20.73779 30.47556 21.95649 35.02822 241.6299  1000  a 
#  mikec_str 20.34941 20.76544 32.05443 22.40155 35.32093 325.3754  1000  a 
#      count 27.20780 29.11735 47.38582 42.93207 44.54086 334.8008  1000   b

答案 1 :(得分:0)

您可以尝试:

require(dplyr)
df %>% group_by(id) %>% summarise(score = max(score), sex = paste(unique(sex),collapse=","))