duplicate rows of dataframe by factor level index

时间:2016-02-03 03:04:26

标签: r

Wonder if there is a smarter and faster way of doing this.

Say I have this dataframe:

library(dplyr)
set.seed(1)
ddf <- data.frame(time=1:20, id=rep(letters[1:5], each=20))
ddf <- ddf %>% group_by(id) %>%  mutate(val1 = rnorm(20), val2 = cumsum(val1)) 

What I want to do is to create 20 copies of this dataframe. (20 because there are 20 unique time values). However, for each copy, I want to not include the current last time value. So the first copy should copy all rows of ddf. The second copy should copy all rows of ddf except for those where ddf$time==20. The next copy should copy all rows except where ddf$time==20 or ddf$time==19, and so on and so forth until the final copy only copies ddf$time==1

Here is my solution:

ddfx <- NULL
for(i in 1:length(unique(ddf$time))){
  ddfx[[i]] <- ddf %>% filter(time<= i )
}

ddfz <- do.call('rbind', Map(cbind, ddfx, ival = 1:length(unique(ddf$time))))

Can it be done faster and more simply?

1 个答案:

答案 0 :(得分:1)

Turning my comment into an answer, if you use data.table you can do

setDT(ddf)[order(-time) , copies := rleid(time) ]
ddf <- ddf[rep(1:.N, copies)][, copies:=NULL]
ddf

# time id       val1       val2
# 1:    1  a -0.6264538 -0.6264538
# 2:    1  a -0.6264538 -0.6264538
# 3:    1  a -0.6264538 -0.6264538
# 4:    1  a -0.6264538 -0.6264538
# 5:    1  a -0.6264538 -0.6264538
# ---                              
# 1046:   18  e -0.5732654  4.0950292
# 1047:   18  e -0.5732654  4.0950292
# 1048:   19  e -1.2246126  2.8704166
# 1049:   19  e -1.2246126  2.8704166
# 1050:   20  e -0.4734006  2.3970160

## quick check
table(ddf$time)
#   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
# 100  95  90  85  80  75  70  65  60  55  50  45  40  35  30  25  20  15  10   5

Explanation

Run-Length Encoding (RLE) ?rle (or ?data.table::rleid)

computes the lengths and values of runs of equal values in a vector

Which means it groups equal values that are in sequence. As the 'copying' required is dependant on time, we can order the time, which puts the same values next to each other in the data.

rle then encodes the equal values into groups sequentially from 1.

We can then use those groups as identifies for the number of copies we require of each group.

Speed Comparison

And as you were after more speed, here is a comparison to your original and the Map solution

fun_orig <- function(x){
    ddfz <- do.call('rbind', Map(cbind, ddfx, ival = 1:length(unique(ddf$time))))
    return(ddfz)
}

fun_map <- function(x){ 
    df <- Map(function(x,y) x[x$time <= y,], list(ddf), 20:1) 
    return(df)
} 

fun_dt <- function(x){
    setDT(ddf)[order(-time) , copies := rleid(time) ]
    ddf <- ddf[rep(1:.N, copies)][, copies:=NULL][]
    return(ddf)
}


library(microbenchmark)

microbenchmark(fun_orig(ddf), fun_map(ddf), fun_dt(ddf))
# Unit: microseconds
#           expr      min       lq     mean   median       uq      max neval cld
#  fun_orig(ddf) 4396.559 4547.975 4883.709 4646.162 4784.530 8002.254   100   c
#   fun_map(ddf) 3341.207 3497.490 3651.714 3588.343 3649.953 6799.140   100  b 
#    fun_dt(ddf)  862.612  955.883 1030.185  998.363 1038.336 3850.275   100 a