我有一个很大的data.frame
这个示例结构:
df <- data.frame(id = rep(c("a","b","c"),4), sex = rep(c("M","F"),6), score = 1:12)
我想通过id
列有效地汇总它,逗号分隔粘贴唯一的sex
值并保留最大score
值。
如何修改此data.table
函数来实现:
setDT(df)[, lapply(.SD, function(x) paste(unique(x), collapse = ",")), by = list(id)]
答案 0 :(得分:2)
您确定要使用strsplit
吗?如何将sex
值保留为list
?像这样:
df[ , .(list(sex), max(score)), by = id]
# id V1 V2
# 1: a M,F,M,F 10
# 2: b F,M,F,M 11
# 3: c M,F,M,F 12
(我们当然可以根据您的喜好命名列)
关于时间安排,我list
paste
与data.table
paste
对dplyr
对dplyr
}在一个非平凡大小的数据集上占主导地位:
set.seed(102349)
NN <- 1e6
DT <- data.table(id = sample(c("a","b","c"), NN, TRUE),
sex = sample(c("M","F"), NN, TRUE),
score = sample(12, NN, TRUE))
library(microbenchmark)
microbenchmark(times = 1000L,
mikec = DT[ , .(list(unique(sex)), max(score)), by = id],
mikec_str = DT[ , .(paste(unique(sex), collapse = ","),
score = max(score)), by = id],
count = DT %>% group_by(id) %>%
summarise(score = max(score),
sex = paste(unique(sex),collapse=",")))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mikec 20.31309 20.73779 30.47556 21.95649 35.02822 241.6299 1000 a
# mikec_str 20.34941 20.76544 32.05443 22.40155 35.32093 325.3754 1000 a
# count 27.20780 29.11735 47.38582 42.93207 44.54086 334.8008 1000 b
答案 1 :(得分:0)
您可以尝试:
require(dplyr)
df %>% group_by(id) %>% summarise(score = max(score), sex = paste(unique(sex),collapse=","))