我搜索了很多,但是没有用,更不用说试图让它自己工作了,所以没有进一步的做法:
我有 data.table :
DT = data.table(date=rep(c(as.Date("2010-01-01"),as.Date("2010-01-02")), each=3), bucket=rep(c("bucket1","bucket2","bucket3"),each=2),
kbucket=c("(0,.5]","(.5,1]","(1,1.5]","(1.5,2]","(1.5,2]","(2.5,3]"),vol=1:6,o=10:15,m=20:25)
看起来像:
date bucket kbucket vol o m
1: 2010-01-01 bucket1 (0,.5] 1 10 20
2: 2010-01-01 bucket1 (.5,1] 2 11 21
3: 2010-01-01 bucket2 (1,1.5] 3 12 22
4: 2010-01-02 bucket2 (1.5,2] 4 13 23
5: 2010-01-02 bucket3 (1.5,2] 5 14 24
6: 2010-01-02 bucket3 (2.5,3] 6 15 25
我在DF上使用了ddply,这是DT的传真,但它是一个数据框:
out <- ddply(DF,.(date,bucket,kbucket),wrap_summarize)
,其中 wrap_summarize定义为:
wrap_summarize = function(x)
{
out <- summarize( x,
N = length(x$date),
sumVol = sum(x$vol),
sumO = sum(x$o),
avgM = mean(x$m,na.rm=TRUE))
}
获取
date bucket kbucket N sumVol sumO avgM
1 2010-01-01 bucket1 (.5,1] 1 2 11 21
2 2010-01-01 bucket1 (0,.5] 1 1 10 20
3 2010-01-01 bucket2 (1,1.5] 1 3 12 22
4 2010-01-02 bucket2 (1.5,2] 1 4 13 23
5 2010-01-02 bucket3 (1.5,2] 1 5 14 24
6 2010-01-02 bucket3 (2.5,3] 1 6 15 25
这是理想的结果。
实际数据具有这种结构,但是有数十万行。因此需要data.table方法。所以我试试这个:
test <- DT[,list(N=length(DT$date),sumVol=sum(DT$vol),sumO=sum(DT$o),avgM=mean(DT$m,na.rm=T)),
by=list(date,bucket,kbucket)]
只是为了获得,这显然不是所期望的:
date bucket kbucket N sumVol sumO avgM
1: 2010-01-01 bucket1 (0,.5] 6 21 75 22.5
2: 2010-01-01 bucket1 (.5,1] 6 21 75 22.5
3: 2010-01-01 bucket2 (1,1.5] 6 21 75 22.5
4: 2010-01-02 bucket2 (1.5,2] 6 21 75 22.5
5: 2010-01-02 bucket3 (1.5,2] 6 21 75 22.5
6: 2010-01-02 bucket3 (2.5,3] 6 21 75 22.5
我想我需要在这里使用.SD,但在这一点上,我认为如果不能获得最有效的解决方案,最好问问并分享这个问题。提前谢谢!
答案 0 :(得分:0)
您正在寻找:
DT[,list(
.N,
sumVol=sum(vol),
sumO=sum(o),
avgM=mean(m,na.rm=T)
),by=list(date,bucket,kbucket)]
给出了
# date bucket kbucket N sumVol sumO avgM
# 1: 2010-01-01 bucket1 (0,.5] 1 1 10 20
# 2: 2010-01-01 bucket1 (.5,1] 1 2 11 21
# 3: 2010-01-01 bucket2 (1,1.5] 1 3 12 22
# 4: 2010-01-02 bucket2 (1.5,2] 1 4 13 23
# 5: 2010-01-02 bucket3 (1.5,2] 1 5 14 24
# 6: 2010-01-02 bucket3 (2.5,3] 1 6 15 25