在data.table的间隔内按日期选择行

时间:2018-06-26 16:37:13

标签: r data.table posix overlap

我想选择一个数据表中的观测值,该观测值在第二个数据表中指定的时间间隔内-该间隔是同时从两个平台进行观测的时间段。

第一个数据表如下所示。这是一堆动物目击事件。

obs = data.table(sighting = as.POSIXct(c("2018-08-12 16:30:00", "2018-08-12 16:35:00", "2018-08-12 16:38:00", "2107-08-13 15:13:00", "2107-08-13 16:13:00", "2017-08-14 11:12:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), encounter = c("1", "1", "1", "2", "3", "4"), what = c("frog", "frog", "toad", "bird", "goat","bird"))

从两个平台进行观察。

platformA = data.table(station = "A", on.effort = as.POSIXct(c("2018-08-12 16:00:00", "2018-08-12 17:35:00","2017-08-14 11:00:13", "2018-08-15 17:35:00"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), off.effort = as.POSIXct(c("2018-08-12 16:36:00", "2018-08-12 18:35:00","2017-08-14 12:12:13", "2018-08-15 18:35:00"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"))

platformB = data.table(station = "B", on.effort = as.POSIXct(c("2018-08-12 16:15:00", "2018-08-12 17:40:00", "2018-08-13 17:40:00","2017-08-14 11:05:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), off.effort = as.POSIXct(c("2018-08-12 16:40:00", "2018-08-13 17:45:00", "2018-08-12 18:20:00","2017-08-14 12:30:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"))

我首先计算每个平台的间隔,然后将其相交以找出何时同时进行观测。

setkey(platformA, on.effort, off.effort)
setkey(platformB, on.effort, off.effort)

common = foverlaps(platformA, platformB,type="any",nomatch=0)

common$x = intersect(interval(common$on.effort, common$off.effort), 
                     interval(common$i.on.effort, common$i.off.effort))

我想得到一个表,该表是“ obs”的子集,并且仅包含“ common $ x”中的区间所覆盖的行。我曾希望使用活页夹在相交的间隔中找到行,并用

为我的目击创建“点”间隔
obs[, sighting2 := sighting]

但是foverlaps希望将每个间隔的“开始”和“结束”放在单独的列中,而不是将间隔存储在common $ x中的方式。

我希望我的输出看起来像这样

           sighting encounter what
2018-08-12 16:30:00         1 frog
2018-08-12 16:35:00         1 frog
2017-08-14 11:12:13         4 bird

任何提示,我将不胜感激。也许我本来可以更高效些? 谢谢。

2 个答案:

答案 0 :(得分:1)

我认为,即使您在平台之间具有不同的观察值,这也应该能起作用。如上使用obsplatformAplatformB数据,使两个平台的间隔或多或少地像您在common中所做的那样:

common = intersect(interval(platformA$on.effort, platformA$off.effort), 
                   interval(platformB$on.effort, platformB$off.effort))

您应该可以使用%within%来检查是否有目击事件落在公共间隔内:

obs$both.seen <- sapply(obs$sighting, function(s){
  any(s %within% common)
})

OR

obs[, both.seen := sapply(sighting, function(x) any(x %within% common))]

新的obs

> obs
              sighting encounter what both.seen
1: 2018-08-12 16:30:00         1 frog      TRUE
2: 2018-08-12 16:35:00         1 frog      TRUE
3: 2018-08-12 16:38:00         1 toad     FALSE
4: 2107-08-13 15:13:00         2 bird     FALSE
5: 2107-08-13 16:13:00         3 goat     FALSE
6: 2017-08-14 11:12:13         4 bird      TRUE

子集以获取所需的输出:

obs <- obs[both.seen == 1][, both.seen := NULL][]

> obs
              sighting encounter what
1: 2018-08-12 16:30:00         1 frog
2: 2018-08-12 16:35:00         1 frog
3: 2017-08-14 11:12:13         4 bird

答案 1 :(得分:0)

相信,这可以为您提供想要的东西。它没有利用data.table函数,而是完全在R上运行。我不确定这是否会导致数据性能问题,但是也许它提供了一种思考更多{{1 }}-esque函数。

data.table

此解决方案的主要方面是library(data.table) # Set up the data obs = data.table(sighting = as.POSIXct(c("2018-08-12 16:30:00", "2018-08-12 16:35:00", "2018-08-12 16:38:00", "2107-08-13 15:13:00", "2107-08-13 16:13:00", "2017-08-14 11:12:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), encounter = c("1", "1", "1", "2", "3", "4"), what = c("frog", "frog", "toad", "bird", "goat","bird")) platformA = data.table(station = "A", on.effort = as.POSIXct(c("2018-08-12 16:00:00", "2018-08-12 17:35:00", "2017-08-14 11:00:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), off.effort = as.POSIXct(c("2018-08-12 16:36:00", "2018-08-12 18:35:00", "2017-08-14 12:12:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax")) platformB = data.table(station = "B", on.effort = as.POSIXct(c("2018-08-12 16:15:00", "2018-08-12 17:40:00", "2017-08-14 11:05:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax"), off.effort = as.POSIXct(c("2018-08-12 16:40:00", "2018-08-12 18:20:00", "2017-08-14 12:30:13"), format = "%Y-%m-%d %H:%M:%OS", tz = "America/Halifax")) # Get the start and end times for each observation (note use of pmax and pmin) starts = pmax(platformA$on.effort, platformB$on.effort) ends = pmin(platformA$off.effort, platformB$off.effort) # For each sighting in obs check if it falls in between any of the intervals seen = sapply(obs$sighting, function(x) { any(x >= starts & x <= ends) }) # Subset the data obs[seen, ] sighting encounter what 1: 2018-08-12 16:30:00 1 frog 2: 2018-08-12 16:35:00 1 frog 3: 2017-08-14 11:12:13 4 bird start的分配。由于我们要在两个平台上寻找观察时间的交点,因此我们的开始时间是两个平台中的较晚时间(即最大),而我们的结束时间是两个平台中最早的时间(即最小)。通过使用endpmin,我们可以分别获取元素的最小值和最大值,以获取时间向量。在pmax中进行比较时,单个时间x >= start & x <= min在元素方面与一对时间xstart[i]进行了元素比较,从而为我们提供了比较间隔。