我有一个数据表,其中包含我要重新整形的开始日期和结束日期,以便在开始日期和结束日期之间的间隔内重复每个日期的所有信息。
我的数据如下
tripstart tripend Country
1: 2014-10-07 2014-10-10 US
2: 2013-06-12 2013-06-13 FR
3: 2013-02-07 2013-02-10 DK
根据这些数据,我要查找的结果与
类似Day Country
2014-10-10 US
2014-10-09 US
2014-10-08 US
2014-10-07 US
2013-06-13 FR
2013-06-12 FR
2013-02-10 DK
2013-02-09 DK
2013-02-08 DK
2013-02-07 DK
我尝试了以下但没有成功,
setkey(hotel_stays, tripstart, tripend)
# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)
hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
setkey(hotel_stays_long, day)
foverlaps(hotel_stays, hotel_stays_long)
数据的R代码:
hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK"))
答案 0 :(得分:1)
感谢Frank,我有两个解决方案。
hotel_stays <- data.table(tripstart = c(as.Date("2014-10-07"), as.Date("2013-06-12"), as.Date("2013-02-07")), tripend = c(as.Date("2014-10-10"), as.Date("2013-06-13"), as.Date("2013-02-10")), Country = c("US", "FR", "DK"))
### Solution 1
setkey(hotel_stays, tripstart, tripend)
# Get the first date that was used as transaction date.
max_date <- max(hotel_stays$tripend, hotel_stays$tripstart)
min_date <- min(hotel_stays$tripend, hotel_stays$tripstart)
hotel_stays_long <- data.table(day = seq.Date(min_date, to = max_date,, length.out = max_date - min_date))
hotel_stays_long[, end := day]
setkey(hotel_stays_long, day, end)
hotel_stays_long <- foverlaps(hotel_stays, hotel_stays_long)
hotel_stays_long[, c("end", "tripstart", "tripend") := NULL]
## Solution 2
hotel_stays_long[, .(day = seq(tripstart, tripend, by = "day"), Country),
by = 1 : nrow(hotel_stays_long)]
我在包含一些其他列的私有数据集上运行了这两个示例。该数据集的信息是,
> dim(hotel_stays)
[1] 4675 28
第一个解决方案导致
replications elapsed relative user.self sys.self user.child sys.child
1 100 1.898 1 1.889 0.005 0 0
第二个解决方案导致
replications elapsed relative user.self sys.self user.child sys.child
1 100 45.244 1 45.253 0 0 0
测试环境
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rbenchmark_1.0.0 data.table_1.9.5 RODBC_1.3-11
loaded via a namespace (and not attached):
[1] tools_3.2.0 chron_2.3-45
结论,第一种解决方案更快但不太优雅。