tidyverse:将特定日期与活动期间匹配

时间:2019-02-04 18:25:42

标签: r vlookup tidyverse

我有一些日期想要与只有开始日期的事件匹配。作为简化的代表,例如,我想弄清楚在某些活动中谁是总裁,但我只有就职日期。

Stopped due to shared library event (no libraries added or removed)

很明显,简单的left_join无法正常工作,因为事件不是在就职典礼那天发生的。

pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush", 
                            "Bill Clinton", "George W. Bush", "Barack 
                             Obama", "Donald Trump"), 
                     inaugdate = structure(c(4037, 6959, 8420, 11342, 14264, 
                                             17186), class = "Date"))

events <- data.frame(event = c("Challenger explosion", "Chernobyl 
                                explosion", "Hurricane Katrina", "9-11"), 
                      date = structure(c(5871, 5959, 13024, 11576), class = "Date"))

在Excel中,vlookup用于为您提供true(匹配最接近的前一个)或false(精确匹配)的选项。 tidyverse中有类似的东西吗?

3 个答案:

答案 0 :(得分:4)

这是获得所需结果的一种方法,尽管可能会有所提倡。您可以创建间隔,间隔是lubridate提供的类,用于指定具有特定开始和结束时间的时间跨度。 %within%运算符随附此命令,以查看日期是否在该间隔内。因此,我们首先可以创建此间隔并将pres列设置为字符类型,以便我们可以对其正确索引。然后,我们使用map_chr遍历事件日期,使用一个函数说“检查该日期是否在每个间隔中,获取它实际所在的日期的索引(使用which),并返回与之相对应的总统”。显然,这要求每个日期只能在一个间隔内找到,否则将失败。

library(tidyverse)
library(lubridate)

pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush", 
                            "Bill Clinton", "George W. Bush",
                            "Barack Obama", "Donald Trump"), 
                   inaugdate = structure(c(4037, 6959, 8420, 11342, 14264, 
                                           17186), class = "Date"))

events <- data.frame(event = c("Challenger explosion", "Chernobyl explosion",
                               "Hurricane Katrina", "9-11"), 
                     date = structure(c(5871, 5959, 13024, 11576), class = "Date"))

pres2 <- pres %>%
  mutate(
    presidency = interval(inaugdate, lead(inaugdate, default = today())),
    pres = as.character(pres)
  )
events %>%
  mutate(pres = map_chr(date, ~ pres2$pres[which(. %within% pres2$presidency)]))
#>                  event       date           pres
#> 1 Challenger explosion 1986-01-28  Ronald Reagan
#> 2  Chernobyl explosion 1986-04-26  Ronald Reagan
#> 3    Hurricane Katrina 2005-08-29 George W. Bush
#> 4                 9-11 2001-09-11 George W. Bush

reprex package(v0.2.1)于2019-02-04创建

答案 1 :(得分:1)

可能不是最有效的,但是我们可以对sqldf使用不等式联接:

library(sqldf)

sqldf('select a.event, a.date, b.pres
      from events a 
      left join pres b
      on a.date >= b.inaugdate
      group by a.event 
      having min(a.date - b.inaugdate)
      order by date, event')

输出:

                 event       date           pres
1 Challenger explosion 1986-01-28  Ronald Reagan
2  Chernobyl explosion 1986-04-26  Ronald Reagan
3                 9-11 2001-09-11 George W. Bush
4    Hurricane Katrina 2005-08-29 George W. Bush

答案 2 :(得分:0)

也许效率不高(取决于行和列的数量),但是是解决问题的另一种方法。

library(dplyr) 

pres <- data.frame(pres = c("Ronald Reagan", "George H. W. Bush", 
                            "Bill Clinton", "George W. Bush", "Barack Obama", "Donald Trump"), 
                   inaugdate = structure(c(4037, 6959, 8420, 11342, 14264, 
                                           17186), class = "Date")) %>% 
                  #lead date to get interval
                  mutate(enddt = lead(inaugdate, default = Sys.Date())-1)

events <- data.frame(event = c("Challenger explosion", "Chernobyl explosion", "Hurricane Katrina", "9-11"), 
                     date = structure(c(5871, 5959, 13024, 11576), class = "Date"))          
#get every combination of rows
newdf <- merge(pres,events,all = TRUE) %>% 
  filter(date >= inaugdate, date < enddt)