在一段时间内传播数据

时间:2015-07-07 22:26:17

标签: r date data.table

我有一个数据表,其中包含我要重新整形的开始日期和结束日期,以便在开始日期和结束日期之间的间隔内重复每个日期的所有信息。

我的数据如下

       tripstart    tripend Country
1: 2014-10-07 2014-10-10      US
2: 2013-06-12 2013-06-13      FR
3: 2013-02-07 2013-02-10      DK

根据这些数据,我要查找的结果与

类似
Day          Country
2014-10-10   US
2014-10-09   US
2014-10-08   US
2014-10-07   US
2013-06-13   FR
2013-06-12   FR
2013-02-10   DK
2013-02-09   DK
2013-02-08   DK
2013-02-07   DK

我尝试了以下但没有成功,

setkey(hotel_stays, tripstart, tripend)

# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)

hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
setkey(hotel_stays_long, day)
foverlaps(hotel_stays, hotel_stays_long)

数据的R代码:

hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK")) 

1 个答案:

答案 0 :(得分:1)

感谢Frank,我有两个解决方案。

hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK"))

### Solution 1
setkey(hotel_stays, tripstart, tripend)

# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)

hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
hotel_stays_long[, end := day]
setkey(hotel_stays_long, day, end)
hotel_stays_long <- foverlaps(hotel_stays, hotel_stays_long)
hotel_stays_long[, c("end", "tripstart", "tripend") := NULL]

## Solution 2
hotel_stays_long[, .(day = seq(tripstart, tripend, by = "day"), Country), 
                       by = 1 : nrow(hotel_stays_long)] 

我在包含一些其他列的私有数据集上运行了这两个示例。该数据集的信息是,

> dim(hotel_stays)
[1] 4675   28

第一个解决方案导致

replications elapsed relative user.self sys.self user.child sys.child
1          100   1.898        1     1.889    0.005          0         0

第二个解决方案导致

 replications elapsed relative user.self sys.self user.child sys.child
1          100  45.244        1    45.253        0          0         0

测试环境

> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rbenchmark_1.0.0 data.table_1.9.5 RODBC_1.3-11    

loaded via a namespace (and not attached):
[1] tools_3.2.0  chron_2.3-45

结论,第一种解决方案更快但不太优雅。