使用日期范围展开data.table,并处理其中一个日期为NA的情况

时间:2018-09-13 23:45:11

标签: r date data.table

考虑data.table dt

    id boro block       date   end_date
 1:  1    1     1 01/01/1991 01/01/1992
 2:  1    1     2 01/01/1991 01/01/1992
 3:  1    2     3 01/01/1991 01/01/1992
 4:  1    2     4 01/01/1991         NA
 5:  2    1     1 01/01/1992 01/01/1993
 6:  2    1     2 01/01/1992 01/01/1993
 7:  2    2     3 01/01/1992         NA
 8:  2    2     5 01/01/1992         NA
 9:  3    1     1 01/01/1993         NA
10:  3    1     2 01/01/1993         NA
11:  3    2     6 01/01/1993         NA
12:  3    2     7 01/01/1993         NA

str(dt)输出的地方:

Classes ‘data.table’ and 'data.frame':  12 obs. of  5 variables:  $ id 
$ id: num  1 1 1 1 2 2 2 2 3 3 ...  
$ boro: num  1 1 2 2 1 1 2 2 1 1
$ block: num  1 2 3 4 1 2 3 5 1 2 ...  
$ date: Date, format: "1991-01-01" "1991-01-01" "1991-01-01" "1991-01-01"...
$ end_date: Date, format: "1992-01-01" "1992-01-01" "1992-01-01" NA ...
  - attr(*, ".internal.selfref")=<externalptr>

我正在尝试按dateend_date提供的日期范围扩展行。 IE,对于第一行,我想将其扩展为:

     id boro block        qtr
 1:    1    1     1 1991-01-01
 2:    1    1     1 1991-04-01
 3:    1    1     1 1991-07-01
 4:    1    1     1 1991-10-01

如果end_date为NA,我想返回一行,其中包含字段idboroblock,以及对应于{{1}的四分之一}。 IE,对于第4行,返回

date

根据此处提出的类似问题的建议,我尝试使用:

     id boro block        qtr
 1:    1    2    4 1991-01-01

但是我收到以下输出:

dt[,.(id,boro,block,qtr = seq(date, end_date, by = "quarter")),by = 1:nrow(dt)]

为了解决Error in seq.int(r1$mon, 12 * (to0$year - r1$year) + to0$mon, by) : 'to' must be a finite number 可以为NA的事实,我尝试过:

end_date

但是由于未知原因,它输出:

dt[,ifelse(!(is.na(end_date)),
               .(id,boro,block,qtr = seq(date, end_date, by = "quarter")),
               .(id,boro,block,qtr = seq(date,date, by = "quarter"))),
       by = 1:nrow(dt)]

注意:我的实际数据有1900万行和70列。因此效率很重要,因此要使用data.table。

2 个答案:

答案 0 :(得分:2)

percentage = (CAST(bags_correct AS FLOAT) / CAST(total_bags AS FLOAT)) * 100

答案 1 :(得分:1)

以下是使用@ComponentScan非等额联接的一种可能方法:

data.table

输出:

dtcols <- c("date", "end_date")
dt[, (dtcols) := lapply(.SD, as.Date, format="%m/%d/%Y"), .SDcols=dtcols]

#create the quarters
quarters <- dt[,.(qtr=seq(min(date), max(end_date, na.rm=TRUE), by="quarter"))]

#perform non-equi join and then handle NA end_date
quarters[dt, .(id, boro, block, x.qtr, i.date, i.end_date), 
    by=.EACHI, on=.(qtr>=date, qtr<end_date)][,
        .(id, boro, block, 
            qtr=as.Date(ifelse(is.na(i.end_date), i.date, x.qtr), origin="1970-01-01"))]

数据:

    id boro block        qtr
 1:  1    1     1 1991-01-01
 2:  1    1     1 1991-04-01
 3:  1    1     1 1991-07-01
 4:  1    1     1 1991-10-01
 5:  1    1     2 1991-01-01
 6:  1    1     2 1991-04-01
 7:  1    1     2 1991-07-01
 8:  1    1     2 1991-10-01
 9:  1    2     3 1991-01-01
10:  1    2     3 1991-04-01
11:  1    2     3 1991-07-01
12:  1    2     3 1991-10-01
13:  1    2     4 1991-01-01
14:  2    1     1 1992-01-01
15:  2    1     1 1992-04-01
16:  2    1     1 1992-07-01
17:  2    1     1 1992-10-01
18:  2    1     2 1992-01-01
19:  2    1     2 1992-04-01
20:  2    1     2 1992-07-01
21:  2    1     2 1992-10-01
22:  2    2     3 1992-01-01
23:  2    2     5 1992-01-01
24:  3    1     1 1993-01-01
25:  3    1     2 1993-01-01
26:  3    2     6 1993-01-01
27:  3    2     7 1993-01-01
    id boro block        qtr