Wonder if there is a smarter and faster way of doing this.
Say I have this dataframe:
library(dplyr)
set.seed(1)
ddf <- data.frame(time=1:20, id=rep(letters[1:5], each=20))
ddf <- ddf %>% group_by(id) %>% mutate(val1 = rnorm(20), val2 = cumsum(val1))
What I want to do is to create 20 copies of this dataframe. (20 because there are 20 unique time values). However, for each copy, I want to not include the current last time value. So the first copy should copy all rows of ddf. The second copy should copy all rows of ddf except for those where ddf$time==20. The next copy should copy all rows except where ddf$time==20 or ddf$time==19, and so on and so forth until the final copy only copies ddf$time==1
Here is my solution:
ddfx <- NULL
for(i in 1:length(unique(ddf$time))){
ddfx[[i]] <- ddf %>% filter(time<= i )
}
ddfz <- do.call('rbind', Map(cbind, ddfx, ival = 1:length(unique(ddf$time))))
Can it be done faster and more simply?
答案 0 :(得分:1)
Turning my comment into an answer, if you use data.table
you can do
setDT(ddf)[order(-time) , copies := rleid(time) ]
ddf <- ddf[rep(1:.N, copies)][, copies:=NULL]
ddf
# time id val1 val2
# 1: 1 a -0.6264538 -0.6264538
# 2: 1 a -0.6264538 -0.6264538
# 3: 1 a -0.6264538 -0.6264538
# 4: 1 a -0.6264538 -0.6264538
# 5: 1 a -0.6264538 -0.6264538
# ---
# 1046: 18 e -0.5732654 4.0950292
# 1047: 18 e -0.5732654 4.0950292
# 1048: 19 e -1.2246126 2.8704166
# 1049: 19 e -1.2246126 2.8704166
# 1050: 20 e -0.4734006 2.3970160
## quick check
table(ddf$time)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5
Explanation
Run-Length Encoding (RLE) ?rle
(or ?data.table::rleid
)
computes the lengths and values of runs of equal values in a vector
Which means it groups equal values that are in sequence. As the 'copying' required is dependant on time
, we can order
the time, which puts the same values next to each other in the data.
rle
then encodes the equal values into groups sequentially from 1.
We can then use those groups as identifies for the number of copies we require of each group.
Speed Comparison
And as you were after more speed, here is a comparison to your original and the Map
solution
fun_orig <- function(x){
ddfz <- do.call('rbind', Map(cbind, ddfx, ival = 1:length(unique(ddf$time))))
return(ddfz)
}
fun_map <- function(x){
df <- Map(function(x,y) x[x$time <= y,], list(ddf), 20:1)
return(df)
}
fun_dt <- function(x){
setDT(ddf)[order(-time) , copies := rleid(time) ]
ddf <- ddf[rep(1:.N, copies)][, copies:=NULL][]
return(ddf)
}
library(microbenchmark)
microbenchmark(fun_orig(ddf), fun_map(ddf), fun_dt(ddf))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# fun_orig(ddf) 4396.559 4547.975 4883.709 4646.162 4784.530 8002.254 100 c
# fun_map(ddf) 3341.207 3497.490 3651.714 3588.343 3649.953 6799.140 100 b
# fun_dt(ddf) 862.612 955.883 1030.185 998.363 1038.336 3850.275 100 a