用于令人尴尬的并行处理的进一步子集数据帧

时间:2013-05-07 15:23:35

标签: r parallel-processing

我有一个令人尴尬的并行问题,我正在使用snowfall包及其功能 sfLapply 进行处理。除了我需要一种更好的分解问题的方法之外,它的效果很好。我的传入数据框如下所示:

Group          Date
1            02/01/12
4            02/01/12
...          ...(31 items)
13           02/01/13
4            02/18/13
5            02/18/13
...          ...(9 items)
22           02/18/13

并且需要按日期拆分为处理组。麻烦的是,只有大约5个不同的日期,所以只使用

split(processing.groups, processing.groups$date)

导致并行作业太少。我想要的是获得一个列表的优雅方式,其中每个列表元素包含不超过20个条目,但保证它们都具有相同的日期。

示例:

List Elem 1:  20 items
1             02/01/12
4             02/01/12
...           ...
9             02/01/12
List Elem 2:  14 items
99            02/01/12
17            02/01/12
...           ...
13            02/01/12
List Elem 3:  11 items
4             02/18/13
5             02/18/13
...           ...
22            02/18/13

感觉就像一些棘手的listy cutty splitty语法应该能够很好地实现这一点。有什么建议吗?

2 个答案:

答案 0 :(得分:1)

我不确定这是否优雅,但是......

# just to setup a dummy dataframe
z <- data.frame(group=1:200, date=sample(c("a","b","c","d"),200,replace=TRUE))

splitz <- split(z, z$date) # split it once
newsplit <- list() # create something to dump the results into
# split the already split stuff into chunks of <= 20
twicesplit <- sapply(splitz, FUN= function(x){
    newsplit <<- c(newsplit,split(x, findInterval(1:dim(x)[1],(1:20*20))) )
    # the `*20` here would have to be longer if you had more than 400 observations with same date
})
rm(twicesplit) # cleanup unnecessary variable used to suppress printing

答案 1 :(得分:1)

这是一种方法:

mydf <- data.frame( Group= sample(45, 45), 
  Date = rep( c('02/01/12', '02/18/13'), c(34, 11) ) )

tmp <- ave( mydf$Group, mydf$Date, 
    FUN=function(x) rep( seq( ceiling(length(x)/20) ),
    each=20, length.out=length(x) ) )

outlist <- split( mydf, interaction(tmp, mydf$Date, drop=TRUE) )