我有一个半定期完成的测量列表 - 这意味着它们应该在一个间隔内完成,但有时会有NA并且测量将重新开始。 在另一个列表中,我有关于事件的信息。 对于每次测量,我想知道过去最后一次事件的日期。我怎么能在R中做到这一点,最好使用dplyr?
library(dplyr)
library(lubridate)
measurements <- tibble(timestamp = seq(ymd('2017-01-01'),
ymd('2017-01-20'),
by = "2 days"),
data = runif(10))
events <- tibble(timestamp = ymd('2017-01-04', '2017-01-12'),
type = 'Start')
expected = ymd(NA, NA, '2017-01-04', '2017-01-04',
'2017-01-04', '2017-01-04',
'2017-01-12', '2017-01-12',
'2017-01-12', '2017-01-12')
measurements %>% mutate(distance = expected)
# A tibble: 10 x 3
timestamp data distance
<date> <dbl> <date>
1 2017-01-01 0.01037106 NA
2 2017-01-03 0.50183512 NA
3 2017-01-05 0.80695523 2017-01-04
4 2017-01-07 0.98605880 2017-01-04
5 2017-01-09 0.78591144 2017-01-04
6 2017-01-11 0.02296494 2017-01-04
7 2017-01-13 0.94335407 2017-01-12
8 2017-01-15 0.10540759 2017-01-12
9 2017-01-17 0.27344290 2017-01-12
10 2017-01-19 0.09080328 2017-01-12
答案 0 :(得分:2)
选项是expand
数据,然后是left_join
与其他数据集
library(tidyverse)
events %>%
transmute(timestamp, distance = timestamp) %>%
right_join(., expand( measurements, timestamp = seq(first(timestamp),
last(timestamp), by = "day"))) %>%
fill(distance) %>%
left_join(measurements, ., by = 'timestamp')
# A tibble: 10 x 3
# timestamp data distance
# <date> <dbl> <date>
#1 2017-01-01 0.6299731 NA
#2 2017-01-03 0.1838285 NA
#3 2017-01-05 0.8636441 2017-01-04
#4 2017-01-07 0.7465680 2017-01-04
#5 2017-01-09 0.6682846 2017-01-04
#6 2017-01-11 0.6180179 2017-01-04
#7 2017-01-13 0.3722381 2017-01-12
#8 2017-01-15 0.5298357 2017-01-12
#9 2017-01-17 0.8746823 2017-01-12
#102017-01-19 0.5817501 2017-01-12
或另一个选项是data.table
,指定roll
library(data.table)
library(zoo)
setDT(measurements)[as.data.table(events)[, distance := timestamp
], distance := distance , on = 'timestamp', roll = -Inf
][, distance := na.locf(distance, na.rm = FALSE)]
measurements
# timestamp data distance
# 1: 2017-01-01 0.2387260 <NA>
# 2: 2017-01-03 0.9623589 <NA>
# 3: 2017-01-05 0.6013657 2017-01-04
# 4: 2017-01-07 0.5150297 2017-01-04
# 5: 2017-01-09 0.4025733 2017-01-04
# 6: 2017-01-11 0.8802465 2017-01-04
# 7: 2017-01-13 0.3640919 2017-01-12
# 8: 2017-01-15 0.2882393 2017-01-12
# 9: 2017-01-17 0.1706452 2017-01-12
#10: 2017-01-19 0.1721717 2017-01-12
注意:由于未设置seed
,如果我们再次创建“测量”数据集,则“数据”(rnorm
)的值将不同
或者@Henrik提到如果我们不想更改'测量'数据集,我们可以做
setDT(events)[setDT(measurements), .(timestamp, data, x.timestamp),
on = "timestamp", roll = Inf]