我有以下两个数据帧:
df1 <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")))
df2 <- data.frame(ID = c("A","A","A","B","C","D","D","D","D","D","E"),
Date = as.POSIXct(c("2018-04-10 07:11:00","2018-04-11 18:59:00","2018-04-12 12:37:00","2018-04-15 01:43:00","2018-04-21 09:52:00","2018-04-15 20:25:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00","2018-04-20 14:11:00","2018-05-01 09:50:00")))
对于df1,我想做两件事: 首先,我想通过df2通过ID查找最近的日期。 其次,我想再次从df2中找到ID以后的最近日期,而无需重复值。在这两种情况下,我都不希望在df1中重复来自df2的日期。
使用data.table包中的roll = Inf功能,我可以在前面的日期中按ID合并。
setDT(df1)
setDT(df2)
setkey(df1, ID, Date)
setkey(df2, ID, Date)[, PrecedingDate:=Date]
result <- df2[df1, roll=Inf]
我不确定如何将最近的日期从df2拉入df1,以及如何确保不重复日期。
结果应如下:
result <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")),
PrecedingDate = as.POSIXct(c("2018-04-11 18:59:00","2018-04-12 02:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-15 20:25:00","2018-04-17 14:21:00",NA,"2018-05-01 09:50:00")),
FollowingDate = as.POSIXct(c("2018-04-12 02:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-21 09:52:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00",NA)))
在这里的任何帮助将不胜感激。
答案 0 :(得分:1)
这是使用dplyr
的解决方案。您可能会收到有关min
max
函数的一些警告,但可以放心地忽略或取消它们。
library(dplyr)
closest_to_zero <- function(x) {
neg <- which(x == max(x[x < 0]))
pos <- which(x == min(x[x > 0]))
c(previous = neg, following = pos)
}
result <- left_join(df1, df2, by = "ID") %>%
group_by(ID, Date.x) %>%
mutate(
time_diff = Date.y - Date.x,
Preceding = Date.y[closest_to_zero(time_diff)["previous"]],
Following = Date.y[closest_to_zero(time_diff)["following"]]
) %>%
distinct(ID, Date.x, Preceding, Following)
# A tibble: 9 x 4
# Groups: ID, Date.x [9]
ID Date.x Preceding Following
<fct> <dttm> <dttm> <dttm>
1 A 2018-04-12 08:56:00 2018-04-11 18:59:00 2018-04-12 12:37:00
2 A 2018-04-13 11:03:00 2018-04-12 12:37:00 NA
3 B 2018-04-14 14:30:00 NA 2018-04-15 01:43:00
4 B 2018-04-15 03:10:00 2018-04-15 01:43:00 NA
5 C 2018-04-16 07:28:00 NA 2018-04-21 09:52:00
6 D 2018-04-17 11:17:00 2018-04-15 20:25:00 2018-04-17 12:33:00
7 D 2018-04-17 14:21:00 2018-04-17 12:33:00 2018-04-18 10:59:00
8 D 2018-04-18 09:56:00 2018-04-17 14:21:00 2018-04-18 10:59:00
9 E 2018-05-02 07:49:00 2018-05-01 09:50:00 NA
答案 1 :(得分:0)
使用data.table的可能解决方案:
df1[, PrecedingDate := df2[df1
, on = .(ID, Date <= Date)
, .(ID, Date = i.Date, pd = x.Date)
][, .SD[.N], by = .(ID, Date)
][shift(pd) == pd, pd := NA][, pd]
][, FollowingDate := df2[df1
, on = .(ID, Date >= Date)
, .(ID, Date = i.Date, fd = x.Date)
][, .SD[1], by = .(ID, Date)][, fd]][]
给出:
> df1 ID Date PrecedingDate FollowingDate 1: A 2018-04-12 08:56:00 2018-04-11 18:59:00 2018-04-12 12:37:00 2: A 2018-04-13 11:03:00 2018-04-12 12:37:00 <NA> 3: B 2018-04-14 14:30:00 <NA> 2018-04-15 01:43:00 4: B 2018-04-15 03:10:00 2018-04-15 01:43:00 <NA> 5: C 2018-04-16 07:28:00 <NA> 2018-04-21 09:52:00 6: D 2018-04-17 11:17:00 2018-04-15 20:25:00 2018-04-17 12:33:00 7: D 2018-04-17 14:21:00 2018-04-17 14:21:00 2018-04-17 14:21:00 8: D 2018-04-18 09:56:00 <NA> 2018-04-18 10:59:00 9: E 2018-05-02 07:49:00 2018-05-01 09:50:00 <NA>
这等于期望的结果:
> all.equal(df1, as.data.table(result))
[1] TRUE
使用的数据:
df1 <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")))
df2 <- data.frame(ID = c("A","A","A","B","C","D","D","D","D","D","E"),
Date = as.POSIXct(c("2018-04-10 07:11:00","2018-04-11 18:59:00","2018-04-12 12:37:00","2018-04-15 01:43:00","2018-04-21 09:52:00","2018-04-15 20:25:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00","2018-04-20 14:11:00","2018-05-01 09:50:00")))
result <- data.frame(ID = c("A","A","B","B","C","D","D","D","E"),
Date = as.POSIXct(c("2018-04-12 08:56:00","2018-04-13 11:03:00","2018-04-14 14:30:00","2018-04-15 03:10:00","2018-04-16 07:28:00","2018-04-17 11:17:00","2018-04-17 14:21:00","2018-04-18 09:56:00","2018-05-02 07:49:00")),
PrecedingDate = as.POSIXct(c("2018-04-11 18:59:00","2018-04-12 12:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-15 20:25:00","2018-04-17 14:21:00",NA,"2018-05-01 09:50:00")),
FollowingDate = as.POSIXct(c("2018-04-12 12:37:00",NA,"2018-04-15 01:43:00",NA,"2018-04-21 09:52:00","2018-04-17 12:33:00","2018-04-17 14:21:00","2018-04-18 10:59:00",NA)))