我有一个令人尴尬的并行问题,我正在使用snowfall包及其功能 sfLapply 进行处理。除了我需要一种更好的分解问题的方法之外,它的效果很好。我的传入数据框如下所示:
Group Date
1 02/01/12
4 02/01/12
... ...(31 items)
13 02/01/13
4 02/18/13
5 02/18/13
... ...(9 items)
22 02/18/13
并且需要按日期拆分为处理组。麻烦的是,只有大约5个不同的日期,所以只使用
split(processing.groups, processing.groups$date)
导致并行作业太少。我想要的是获得一个列表的优雅方式,其中每个列表元素包含不超过20个条目,但保证它们都具有相同的日期。
示例:
List Elem 1: 20 items
1 02/01/12
4 02/01/12
... ...
9 02/01/12
List Elem 2: 14 items
99 02/01/12
17 02/01/12
... ...
13 02/01/12
List Elem 3: 11 items
4 02/18/13
5 02/18/13
... ...
22 02/18/13
感觉就像一些棘手的listy cutty splitty语法应该能够很好地实现这一点。有什么建议吗?
答案 0 :(得分:1)
我不确定这是否优雅,但是......
# just to setup a dummy dataframe
z <- data.frame(group=1:200, date=sample(c("a","b","c","d"),200,replace=TRUE))
splitz <- split(z, z$date) # split it once
newsplit <- list() # create something to dump the results into
# split the already split stuff into chunks of <= 20
twicesplit <- sapply(splitz, FUN= function(x){
newsplit <<- c(newsplit,split(x, findInterval(1:dim(x)[1],(1:20*20))) )
# the `*20` here would have to be longer if you had more than 400 observations with same date
})
rm(twicesplit) # cleanup unnecessary variable used to suppress printing
答案 1 :(得分:1)
这是一种方法:
mydf <- data.frame( Group= sample(45, 45),
Date = rep( c('02/01/12', '02/18/13'), c(34, 11) ) )
tmp <- ave( mydf$Group, mydf$Date,
FUN=function(x) rep( seq( ceiling(length(x)/20) ),
each=20, length.out=length(x) ) )
outlist <- split( mydf, interaction(tmp, mydf$Date, drop=TRUE) )