在dplyr中查找过去的事件

时间:2017-09-27 08:19:46

标签: r dplyr tidyverse

我有一个半定期完成的测量列表 - 这意味着它们应该在一个间隔内完成,但有时会有NA并且测量将重新开始。 在另一个列表中,我有关于事件的信息。 对于每次测量,我想知道过去最后一次事件的日期。我怎么能在R中做到这一点,最好使用dplyr?

library(dplyr)
library(lubridate)

measurements <- tibble(timestamp = seq(ymd('2017-01-01'), 
                                       ymd('2017-01-20'), 
                                       by = "2 days"),
                       data = runif(10))

events <- tibble(timestamp = ymd('2017-01-04', '2017-01-12'), 
                 type = 'Start')

expected = ymd(NA, NA, '2017-01-04', '2017-01-04', 
               '2017-01-04', '2017-01-04', 
               '2017-01-12', '2017-01-12',
               '2017-01-12', '2017-01-12')

measurements %>% mutate(distance = expected)

# A tibble: 10 x 3
    timestamp       data   distance
       <date>      <dbl>     <date>

 1 2017-01-01 0.01037106         NA
 2 2017-01-03 0.50183512         NA
 3 2017-01-05 0.80695523 2017-01-04
 4 2017-01-07 0.98605880 2017-01-04
 5 2017-01-09 0.78591144 2017-01-04
 6 2017-01-11 0.02296494 2017-01-04
 7 2017-01-13 0.94335407 2017-01-12
 8 2017-01-15 0.10540759 2017-01-12
 9 2017-01-17 0.27344290 2017-01-12
10 2017-01-19 0.09080328 2017-01-12

1 个答案:

答案 0 :(得分:2)

选项是expand数据,然后是left_join与其他数据集

library(tidyverse)
events %>% 
    transmute(timestamp, distance = timestamp) %>%
    right_join(., expand( measurements, timestamp = seq(first(timestamp),
                        last(timestamp), by = "day"))) %>%
    fill(distance) %>% 
    left_join(measurements, ., by = 'timestamp')
 # A tibble: 10 x 3
 #   timestamp      data   distance
 #      <date>     <dbl>     <date>
 #1 2017-01-01 0.6299731         NA
 #2 2017-01-03 0.1838285         NA
 #3 2017-01-05 0.8636441 2017-01-04
 #4 2017-01-07 0.7465680 2017-01-04
 #5 2017-01-09 0.6682846 2017-01-04
 #6 2017-01-11 0.6180179 2017-01-04
 #7 2017-01-13 0.3722381 2017-01-12
 #8 2017-01-15 0.5298357 2017-01-12
 #9 2017-01-17 0.8746823 2017-01-12
 #102017-01-19 0.5817501 2017-01-12

或另一个选项是data.table,指定roll

library(data.table)
library(zoo)
setDT(measurements)[as.data.table(events)[, distance := timestamp
    ], distance := distance , on = 'timestamp', roll = -Inf
     ][, distance := na.locf(distance, na.rm = FALSE)]
measurements
#     timestamp      data   distance
# 1: 2017-01-01 0.2387260       <NA>
# 2: 2017-01-03 0.9623589       <NA>
# 3: 2017-01-05 0.6013657 2017-01-04
# 4: 2017-01-07 0.5150297 2017-01-04
# 5: 2017-01-09 0.4025733 2017-01-04
# 6: 2017-01-11 0.8802465 2017-01-04
# 7: 2017-01-13 0.3640919 2017-01-12
# 8: 2017-01-15 0.2882393 2017-01-12
# 9: 2017-01-17 0.1706452 2017-01-12
#10: 2017-01-19 0.1721717 2017-01-12

注意:由于未设置seed,如果我们再次创建“测量”数据集,则“数据”(rnorm)的值将不同

或者@Henrik提到如果我们不想更改'测量'数据集,我们可以做

setDT(events)[setDT(measurements), .(timestamp, data, x.timestamp),
             on = "timestamp", roll = Inf]