背景问题:
假设我们有一个类似的数据集:
ID DRIVE_NUM FLAG
1 A PASS
2 A FAIL
3 A PASS
-----------------
4 B PASS
5 B PASS
6 B PASS
-----------------
7 C PASS
8 C FAIL
9 C FAIL
我想通过以下规则聚合DRIVE_NUM的数据集:
对于特定的DRIVE_NUM组,
如果DRIVE_NUM组中有任何FAIL标志,我想要第一行 与失败的旗帜。
如果组中没有FAIL标志,只需取出第一行 基。
所以,我将得到以下内容:
ID DRIVE_NUM FLAG
2 A FAIL
4 B PASS
8 C FAIL
更新
似乎dplyr解决方案甚至比plyr慢。我不恰当地使用任何东西吗?
#Simulate Data
X = data.frame(
group = rep(paste0("NO",1:10000),each=2),
flag = sample(c("F","P"),20000,replace = TRUE),
var = rnorm(20000)
)
library(plyr)
library(dplyr)
#plyr
START = proc.time()
X2 = ddply(X,.(flag),function(df) {
if( sum(df$flag=="F")> 0){
R = df[df$flag=="F",]
if(nrow(R)>1) {R = R[1,]} else {R = R}
} else{
R = df[1,]
}
R
})
proc.time() - START
#user system elapsed
#0.03 0.00 0.03
#dplyr method 1
START = proc.time()
X %>%
group_by(group) %>%
slice(which.min(flag))
proc.time() - START
#user system elapsed
#0.22 0.02 0.23
#dplyr method 2
START = proc.time()
X %>%
group_by(group, flag) %>%
slice(1) %>%
group_by(group) %>%
slice(which.min(flag))
proc.time() - START
#user system elapsed
#0.28 0.00 0.28
是否有一个data.table版本可以比plyr快得多?
答案 0 :(得分:6)
使用data.table
library(data.table)
START = proc.time()
X3 = as.data.table(X)[X[, .I[which.min(flag)] , by = group]$V1]
proc.time() - START
# user system elapsed
# 0.00 0.02 0.02
或使用order
START = proc.time()
X4 = as.data.table(X)[order(flag), .SD[1L] , by = group]
proc.time() - START
# user system elapsed
# 0.02 0.00 0.01
使用OP代码的dplyr
和plyr
的相应时间是
# user system elapsed
# 0.28 0.04 2.68
# user system elapsed
# 0.01 0.06 0.67
同样由@Frank评论,base R
方法时间是
START = proc.time()
Z = X[order(X$flag),]
X5 = with(Z, Z[tapply(seq(nrow(X)), group, head, 1), ])
proc.time() - START
# user system elapsed
# 0.15 0.03 0.65
我猜slice
正在放慢dplyr
。
答案 1 :(得分:3)
嗯,这并不比data.table
快,但绝对是一种改进:
START = proc.time()
m3 <- X %>%
group_by(group) %>%
arrange(flag) %>%
slice(1)
proc.time() - START
#user system elapsed
#0.03 0.00 0.05
# OP - method 1
START = proc.time()
m1 <- X %>%
group_by(group) %>%
slice(which.min(flag))
proc.time() - START
#user system elapsed
#0.31 0.00 0.33
# OP - method 2
START = proc.time()
m2 <- X %>%
group_by(group, flag) %>%
slice(1) %>%
group_by(group) %>%
slice(which.min(flag))
proc.time() - START
#user system elapsed
#0.39 0.02 0.45
identical(m2, m3)
[1] TRUE