我正在尝试将SQL代码转换为R代码。但是,数据大约有3500万条记录,每条记录有200列。所以我能找到的最佳选择是data.table包。
这是问题所在。 在SQL代码中,我能够执行这样的操作,
select order_date,sum(case when item in ("D","C","B") then col4 end)as col1
sum(case when item not in ("Z","X","Y") then col4 end) as col2
from datatable
where col3 <25
group by order_date;
以上查询允许我按每个日期分组。我无法在data.table中复制它。我的尝试如下。
grp1<- c("D","C","B")
grp2<- c("Z","X","Y")
d1 <- dat[item %in% grp1,.(col1 = sum(col4,na.rm = TRUE),by = Order_Date]
d2 <- dat[item %in% grp2,.(col2 = sum(col4,na.rm = TRUE),by = Order_Date]
d3 <- data.table(d1,d2)
现在,因为它subsets
,我的分组在d1
和d2
答案 0 :(得分:7)
您可以尝试以下操作:
DT[col3 < 25,
.(col1 = sum(col4[item %in% c("D","C","B")]),
col2 = sum(col4[!item %in% c("Z","X","Y")])),
by = .(order_date)]
答案 1 :(得分:0)
> d <- "
+ order_date,item,col4,col3
+ 2000-01-01,D,1,10
+ 2000-01-01,C,1,10
+ 2000-01-01,M,1,10
+ 2000-01-01,N,1,50
+ 2000-01-01,Z,1,10
+ 2000-01-01,X,1,10
+ 2001-01-02,Z,1,0
+ 2001-01-02,X,1,50"
>
> df = read.csv(textConnection(d))
>
> # data.frame + plyr approach
>
> require(plyr)
Loading required package: plyr
> ddply(
+ df[df$col3<25,],
+ .(order_date),
+ summarize,
+ col1=sum(item %in% c("D","C","B") & col4),
+ col2=sum(!item %in% c("Z","X","Y") & col4)
+ )
order_date col1 col2
1 2000-01-01 2 3
2 2001-01-02 0 0
>
> # data.table approach, thanks to jangorecki
>
> require(data.table)
Loading required package: data.table
> dt = data.table(df)
>
> dt[col3 < 25,
+ .(col1 = sum(col4[item %in% c("D","C","B")]),
+ col2 = sum(col4[!item %in% c("Z","X","Y")])),
+ by = .(order_date)]
order_date col1 col2
1: 2000-01-01 2 3
2: 2001-01-02 0 0
>