我正在尝试根据ID和日期将第二个数据集中的信息添加到我的第一个数据集中。如果ID匹配且日期为'介于'开始'和'结束',我想将颜色的值添加到df1。
$i=""
获得这样的结果:
df1
ID Date
1 3/31/2017
2 2/11/2016
2 4/10/2016
3 5/15/2015
df2
ID start end colour
1 1/1/2000 3/31/2011 blue
1 4/1/2011 6/4/2012 purple
1 6/5/2012 3/31/2017 blue
2 5/1/2014 3/31/2017 red
3 1/12/2012 2/12/2014 purple
可以使用以下代码创建:
dat
ID Date colour
1 3/31/2017 blue
2 2/11/2016 red
2 4/10/2016 red
3 5/15/2015 NA
我使用了类似问题的回复, Checking if Date is Between two Dates in R 并使用以下代码:
library(lubridate)
df1 <- tibble(ID = c(1,2,2,3), Date = mdy(c("3/31/2017","2/11/2016","4/10/2016","5/15/2015")))
df2 <- tibble(ID = c(1,1,1,2,3), start = mdy(c("1/1/2000","4/1/2011","6/5/2012","5/1/2014","1/12/2012")), end = mdy(c("3/31/2011","6/4/2012","3/31/2017","3/31/2017","2/12/2014")), colour = c("blue", "purple", "blue", "red", "purple"))
我收到以下错误:
介于两者之间的错误(df1 $ Date,df2 $ start,df2 $ end): 期待单个值:[extent = 355368]。
帮助?
非常感谢!
更新 -
非常感谢大家的回答。
我尝试了所有这些,但所有最终数据集的行数都不同于第一个数据集。我不确定发生了什么。我发布的数据与我正在使用的数据类似。是否有其他细节我应该通知你?我不知道从哪里开始......
答案 0 :(得分:2)
您的数据框似乎很大,您可以尝试data.table
非equi join以高效的方式执行此操作:
library(lubridate)
library(data.table)
setDT(df1); setDT(df2)
df1[, Date := mdy(Date)]
df2[, c("start", "end") := .(mdy(start), mdy(end))]
df2[df1, .(ID = i.ID, Date = i.Date, colour), on=.(ID, start <= Date, end >= Date)]
# ID Date colour
#1: 1 2017-03-31 blue
#2: 2 2016-02-11 red
#3: 2 2016-04-10 red
#4: 3 2015-05-15 NA
答案 1 :(得分:1)
我复制了你的例子并给它一个解决方案。
library(tidyverse)
library(lubridate)
df1 <- data.frame(ID=c(1, 2, 2, 3),
actual.date=mdy('3/31/2017', '2/11/2016','4/10/2016','5/15/2015'))
df2 <- data.frame(ID = c(1, 1, 1, 2, 3),
start = mdy('1/1/2000', '4/1/2011', '6/5/2012', '5/1/2014', '1/12/2012'),
end = mdy('3/31/2011', '6/4/2012', '3/31/2017', '3/31/2017', '2/12/2014'),
colour = c("blue", "purple", "blue", "red", "purple"))
df <- full_join(df1, df2, by = "ID") %>%
mutate(test = ifelse(actual.date <= end & actual.date > start,
TRUE,
FALSE)) %>%
filter(test) %>%
left_join(df1, ., by = c("ID", "actual.date")) %>%
select(ID, actual.date, colour)
(不需要rubridate包,但输入日期很方便)
请下次提供一个可重复的示例,以便我们不必手动重写数据!
答案 2 :(得分:1)
使用sqldf
library(sqldf)
df1$Date <- as.Date(df1$Date, "%m/%d/%Y")
df2$start <- as.Date(df2$start, "%m/%d/%Y")
df2$end <- as.Date(df2$end, "%m/%d/%Y")
sqldf({"
SELECT df1.*, df2.colour FROM df1
INNER JOIN df2
ON df1.ID = df2.ID AND df1.Date <= df2.end AND df1.Date >= df2.start
"})
答案 3 :(得分:1)
dplyr
使用non standard evaluation,因此您可以转储所有数据框名称和$
,并且您的代码基本上以正确的方向开始。您还需要进行一些隐式转换,以便最终得到您指定的数据框,但下面的内容将为您提供帮助。
dat <-
df1 %>%
inner_join(df2) %>%
rowwise %>%
mutate(match = ifelse(between(Date, start, end), 1 , NA)) %>%
arrange(ID, Date, desc(match)) %>%
ungroup %>%
group_by(ID, Date) %>%
mutate(best = row_number(ID),
colour = if_else(is.na(match), NA_character_, colour)) %>%
filter(best == 1) %>%
select(ID, Date, colour)
> dat # A tibble: 4 x 3 # Groups: ID, Date [4] ID Date colour <dbl> <date> <chr> 1 1 2017-03-31 blue 2 2 2016-02-11 red 3 2 2016-04-10 red 4 3 2015-05-15 <NA>