查找dataset1和dataset2之间的最近日期

时间:2017-02-12 03:54:53

标签: r date lubridate

我有两个数据集。大约每5天收集一次,另一次每天每15分钟收集一次。我想要一个最终列表,该列表匹配从频率较低的数据集到较频繁数据集中的条目的最接近日期。

例如:

satDat <- c('2015-04-16', '2015-04-21', '2012-04-26') # collected every 5 days

stationDat <- sort(rep(seq(as.Date("2015-04-01"), as.Date("2015-04-20"), by='day'),2)) 
#collected multiple times a day

 [1] "2015-04-01" "2015-04-01" "2015-04-02" "2015-04-02" "2015-04-03"
 [6] "2015-04-03" "2015-04-04" "2015-04-04" "2015-04-05" "2015-04-05"
[11] "2015-04-06" "2015-04-06" "2015-04-07" "2015-04-07" "2015-04-08"
[16] "2015-04-08" "2015-04-09" "2015-04-09" "2015-04-10" "2015-04-10"
[21] "2015-04-11" "2015-04-11" "2015-04-12" "2015-04-12" "2015-04-13"
[26] "2015-04-13" "2015-04-14" "2015-04-14" "2015-04-15" "2015-04-15"
[31] "2015-04-16" "2015-04-16" "2015-04-17" "2015-04-17" "2015-04-18"
[36] "2015-04-18" "2015-04-19" "2015-04-19" "2015-04-20" "2015-04-20"

我希望我的结果看起来像这样

[1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" 
[21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
[36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

2 个答案:

答案 0 :(得分:0)

我想到了包data.table提供的滚动连接。

library(data.table)
DT1 <- data.table(date = as.Date(satDat), date1 = as.Date(satDat))
DT2 <- data.table(date = stationDat)

DT1[DT2, date1, roll = "nearest", on = .(date)]
# [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
# [7] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[13] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[19] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[25] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#[37] "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

它可能对你的实际任务也很有用,因为我怀疑它超出了这个范围。

答案 1 :(得分:0)

使用outer的选项:

satDat[apply(abs(outer(satDat, stationDat, difftime, units = 'days')), 2, which.min)]

#>  [1] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#>  [6] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [11] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [16] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [21] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [26] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [31] "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16" "2015-04-16"
#> [36] "2015-04-16" "2015-04-21" "2015-04-21" "2015-04-21" "2015-04-21"

工作原理:

  • outer对两个向量中的每对元素应用difftime,返回一个矩阵,
  • apply遍历列(MARGIN = 2),在每个列上调用which.min,返回最小的索引,
  • 用于子集satDat

请注意outer分配尺寸为length(satDat) length(stationDat)的矩阵,如果您的数据已经很大,则可能需要大量内存。