R data.table lapply with cut function

时间:2017-05-03 16:59:02

标签: r data.table lapply

我试图按照预定的断点对数据进行分组。我可以使用cut()函数在给定年份轻松完成此操作,但我很难在data.table函数中使其工作。这是测试数据:

set.seed(4)
YR = data.table(yr=1962:2015)
ID = data.table(id=10001:11000)
DT <- YR[,as.list(ID), by = yr] # intentional cartesian join
# now add data
DT[,`:=` (ratio = rep(sample(10),each=2700)+rnorm(nrow(DT)))]

这给出了三列:yr,id和ratio,后者是我想要分组的数据。

现在这里是年度断点:

DTy <- data.table(matrix(rep(1:10,each=nrow(YR)),nrow(YR),10)+rnorm(54)/10)
DTy[,yr := 1962:2015]

因此,从1962年到2015年的每一年都有一系列截止日期。以1962年为例,以下是我要做的事情:

group <- cut(DT[yr==1962,ratio],DTy[yr==1962, -c("yr")], labels = FALSE)

这就是它的样子。

ratio <- DT[yr==1962,ratio]
DTy[yr==1962,-c("yr")]
test <- data.table(ratio,group)
test[,yr:=1962]
> test
         ratio group   yr
   1: 6.689275     6 1962
   2: 4.718753     4 1962
   3: 5.786855     5 1962
   4: 7.896540     7 1962
   5: 7.776863     7 1962
  ---                    
 996: 6.176614     6 1962
 997: 7.689046     7 1962
 998: 4.652658     4 1962
 999: 7.075622     7 1962
1000: 5.543791     5 1962

我试过了:

# merge two datasets together
newDT <- merge(x = DT, y = DTy, by = c("yr"))
# get names of columns with breakpoints
cnames <- names(newDT)[newDT[,grep("^V", names(newDT))]] 
# apply the cut function by year. 
newDT[,groupt := lapply(.SD, cut, as.vector(unique(.SD[,cnames,with=FALSE])), labels = FALSE, include.lowest = TRUE), by = .(year), .SDcols = c("ratio", cnames)]

但是我收到了这个错误:

Error in `[.data.table`(newDT, , `:=`(groupt, lapply(.SD, cut, as.vector(unique(.SD[,  : 
  column or expression 1 of 'by' or 'keyby' is type closure. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

老实说,我不确定这意味着什么。我试图像使用data.table一样使用.SD,但cut()函数的不同部分需要不同的列,我不确定如何通过lapply传递它们。

2 个答案:

答案 0 :(得分:1)

以下是使用联接和by=.EACHI的一种方法:

DT[DTy, on="yr",
   cutted := cut(ratio, c(i.V1, i.V2, i.V3, i.V4, i.V5, i.V6, i.V7, i.V8, i.V9, i.V10),
                 labels=FALSE), by=.EACHI]

这里,带有中断的data.table按年加入主data.table,然后使用cut为:=分配一个新变量。这将分别应用于使用by=.EAHCHI连接值的每个组。

返回

head(DT)
     yr    id    ratio cutted
1: 1962 10001 6.689275      6
2: 1962 10002 4.718753      4
3: 1962 10003 5.786855      5
4: 1962 10004 7.896540      7
5: 1962 10005 7.776863      7
6: 1962 10006 6.566604      6

您可以使用mgetls进行模式搜索,以消除枚举cut中break参数中使用的变量的需要。

DT[DTy, on="yr", cutted := cut(ratio, c(mget(ls(pattern="^i\\.V\\d+$"))), labels=FALSE),
   by=.EACHI]

答案 1 :(得分:1)

这是我通过简单修改代码得到的:

newDT <- merge(x = DT, y = DTy, by = c("yr"))
# get names of columns with breakpoints
cnames <- names(newDT)[grep("^V", names(newDT))]

# apply the cut function by year. 
res <- newDT[, group := cut(ratio, unlist(.SD[1]), labels = F),
             by = .(yr), .SDcols = cnames][
                 , .(yr, id, ratio, group)]

#          yr    id     ratio cutted
# 1: 1962 10001  6.689275      6
# 2: 1962 10002  4.718753      4
# 3: 1962 10003  5.786855      5
# 4: 1962 10004  7.896540      7
# 5: 1962 10005  7.776863      7
# ---                            
# 53996: 2015 10996 10.613272     NA
# 53997: 2015 10997 11.260932     NA
# 53998: 2015 10998  8.591909      8
# 53999: 2015 10999  9.143039      9
# 54000: 2015 11000  7.470945      7