我有一些日期想要与只有开始日期的事件匹配。作为简化的代表,例如,我想弄清楚在某些活动中谁是总裁,但我只有就职日期。
Stopped due to shared library event (no libraries added or removed)
很明显,简单的left_join无法正常工作,因为事件不是在就职典礼那天发生的。
pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush",
"Bill Clinton", "George W. Bush", "Barack
Obama", "Donald Trump"),
inaugdate = structure(c(4037, 6959, 8420, 11342, 14264,
17186), class = "Date"))
events <- data.frame(event = c("Challenger explosion", "Chernobyl
explosion", "Hurricane Katrina", "9-11"),
date = structure(c(5871, 5959, 13024, 11576), class = "Date"))
在Excel中,vlookup用于为您提供true(匹配最接近的前一个)或false(精确匹配)的选项。 tidyverse中有类似的东西吗?
答案 0 :(得分:4)
这是获得所需结果的一种方法,尽管可能会有所提倡。您可以创建间隔,间隔是lubridate
提供的类,用于指定具有特定开始和结束时间的时间跨度。 %within%
运算符随附此命令,以查看日期是否在该间隔内。因此,我们首先可以创建此间隔并将pres
列设置为字符类型,以便我们可以对其正确索引。然后,我们使用map_chr
遍历事件日期,使用一个函数说“检查该日期是否在每个间隔中,获取它实际所在的日期的索引(使用which
),并返回与之相对应的总统”。显然,这要求每个日期只能在一个间隔内找到,否则将失败。
library(tidyverse)
library(lubridate)
pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush",
"Bill Clinton", "George W. Bush",
"Barack Obama", "Donald Trump"),
inaugdate = structure(c(4037, 6959, 8420, 11342, 14264,
17186), class = "Date"))
events <- data.frame(event = c("Challenger explosion", "Chernobyl explosion",
"Hurricane Katrina", "9-11"),
date = structure(c(5871, 5959, 13024, 11576), class = "Date"))
pres2 <- pres %>%
mutate(
presidency = interval(inaugdate, lead(inaugdate, default = today())),
pres = as.character(pres)
)
events %>%
mutate(pres = map_chr(date, ~ pres2$pres[which(. %within% pres2$presidency)]))
#> event date pres
#> 1 Challenger explosion 1986-01-28 Ronald Reagan
#> 2 Chernobyl explosion 1986-04-26 Ronald Reagan
#> 3 Hurricane Katrina 2005-08-29 George W. Bush
#> 4 9-11 2001-09-11 George W. Bush
由reprex package(v0.2.1)于2019-02-04创建
答案 1 :(得分:1)
可能不是最有效的,但是我们可以对sqldf
使用不等式联接:
library(sqldf)
sqldf('select a.event, a.date, b.pres
from events a
left join pres b
on a.date >= b.inaugdate
group by a.event
having min(a.date - b.inaugdate)
order by date, event')
输出:
event date pres
1 Challenger explosion 1986-01-28 Ronald Reagan
2 Chernobyl explosion 1986-04-26 Ronald Reagan
3 9-11 2001-09-11 George W. Bush
4 Hurricane Katrina 2005-08-29 George W. Bush
答案 2 :(得分:0)
也许效率不高(取决于行和列的数量),但是是解决问题的另一种方法。
library(dplyr)
pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush",
"Bill Clinton", "George W. Bush", "Barack Obama", "Donald Trump"),
inaugdate = structure(c(4037, 6959, 8420, 11342, 14264,
17186), class = "Date")) %>%
#lead date to get interval
mutate(enddt = lead(inaugdate, default = Sys.Date())-1)
events <- data.frame(event = c("Challenger explosion", "Chernobyl explosion", "Hurricane Katrina", "9-11"),
date = structure(c(5871, 5959, 13024, 11576), class = "Date"))
#get every combination of rows
newdf <- merge(pres,events,all = TRUE) %>%
filter(date >= inaugdate, date < enddt)