我有一个包含两列的数据集,ID
和Start_Date
如下所示
ID Start_Date
19 2016-11-24
19 2016-11-26
3C 2016-01-16
3C 2016-03-18
14 2018-03-03
14 2018-01-19
第二个数据集,其中包含每个ID
在不同日期的一些随机购买数据
ID Transaction_Date Item
19 2015-10-24 Pop
19 2015-12-11 Crackers
19 2017-11-25 Honey
19 2018-03-14 PBJ
19 2018-11-24 Roku_Stick
19 2019-01-10 Pop
19 2019-02-15 LipBalm
19 2019-03-25 Pop
3C 2015-04-16 Honey
3C 2016-02-20 PBJ
3C 2016-08-04 Crackers
3C 2019-05-12 Roku_Stick
14 2017-07-11 Pop
14 2018-09-26 Pop
我的意图是
1)通过ID
合并两个数据集,这很容易,我知道我们可以使用merge
函数df_result <- merge(df1, df2, by = "ID", all = TRUE)
2)对于每个ID,仅保留第一数据集中Start_Date
两年内的第二数据集中的行。
我的意思是,考虑数据集1中的第一个观察值,例如ID
19,StartDate
是2016-10-24
。因此,包含了第二个数据集中的这些行,而排除了这些行
ID Transaction_Date Item Status
19 2015-10-24 Pop Exclude, because earlier than start date 2016-11-24
19 2015-10-24 Crackers Exclude, because earlier than start date 2016-11-24
19 2017-11-25 Honey Include, because transaction occurs after the start date 2016-11-24 and within 2 years of 2016-10-24
19 2018-03-14 PBJ Include, because transaction occurs after the start date 2016-11-24 and within 2 years of 2016-10-24
19 2018-11-24 Roku_Stick Include, because transaction occurs after the start date 2016-11-24 and within 2 years of 2016-10-24
19 2019-01-10 Pop Exclude, because transaction is after 2 years of start date 2016-11-24
19 2019-02-15 Lip Balm Exclude, because transaction is after 2 years of start date 2016-11-24
19 2019-03-25 Pop Exclude, because transaction is after 2 years of start date 2016-11-24
最终预期数据集
ID Start_Date Pop Crackers Honey PBJ Roku_Stick Lip Balm
19 2017-11-24 No Yes Yes Yes Yes No
类似地
ID Start_Date Pop Crackers Honey PBJ Roku_Stick LipBalm
19 2016-11-26 No Yes Yes Yes Yes No
3C 2016-01-16 No Yes No Yes No No
14 2018-03-03 Yes No No No No No
14 2018-01-19 Yes No No No No No
我知道使用
进行此操作的时间很长 merge
,
if-else Start_Date +2 <= Transaction_Date, Include, Exclude
,
df <- df[ subset(Include),]
df <- long to wide.
我有兴趣探索一种非常有效的方法来转换此数据集。非常感谢您的协助。预先感谢。
########可重现数据集df1 <- structure(list(ID = structure(c(2L, 2L, 3L, 3L, 1L, 1L), .Label = c("14",
"19", "3c"), class = "factor"), Start_Date = structure(c(3L,
4L, 1L, 2L, 6L, 5L), .Label = c("2016-01-16", "2016-03-18", "2016-11-24",
"2016-11-26", "2018-01-19", "2018-03-03"), class = "factor")), .Names = c("ID",
"Start_Date"), row.names = c(NA, -6L), class = "data.frame")
df2 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 1L, 1L), .Label = c("14", "19", "3C"), class = "factor"),
Transaction_Date = structure(c(2L, 3L, 7L, 8L, 10L, 11L,
12L, 13L, 1L, 4L, 5L, 14L, 6L, 9L), .Label = c("2015-04-16",
"2015-10-24", "2015-12-11", "2016-02-20", "2016-08-04", "2017-07-11",
"2017-11-25", "2018-03-14", "2018-09-26", "2018-11-24", "2019-01-10",
"2019-02-15", "2019-03-25", "2019-05-12"), class = "factor"),
Item = structure(c(6L, 1L, 3L, 5L, 7L, 6L, 4L, 6L, 3L, 5L,
2L, 7L, 6L, 6L), .Label = c("Crackers", "Crakerss", "Honey",
"LipBalm", "PBJ", "Pop", "Roku_Stick"), class = "factor")), .Names = c("ID",
"Transaction_Date", "Item"), row.names = c(NA, -14L), class = "data.frame")
答案 0 :(得分:1)
这是一个tidyverse
解决方案。首先,我们加入,然后将日期转换为Date
对象。接下来,我们使用一些filter
工具(lubridate
)应用两个%m+% years(2)
约束,select
我们要保留的列,创建一个额外的列,其中所有内容均为{{1 }},这样我们就可以TRUE
进入每个项目的列。 spread
用fill = F
而不是FALSE
填充缺失值。
NA
数据:
library(lubridate)
library(dplyr)
library(tidyr)
df2 %>%
dplyr::left_join(df1, by = "ID") %>%
dplyr::mutate(Transaction_Date = as.Date(Transaction_Date),
Start_Date = as.Date(Start_Date)) %>%
dplyr::filter(Transaction_Date < (Start_Date %m+% years(2)) & Transaction_Date >= Start_Date) %>%
dplyr::select(ID, Start_Date, Item) %>%
dplyr::mutate(ItemTrue = TRUE) %>%
tidyr::spread(Item, ItemTrue, fill = F)
ID Start_Date Crackers Honey PBJ Pop Roku_Stick
1 14 2018-01-19 FALSE FALSE FALSE TRUE FALSE
2 14 2018-03-03 FALSE FALSE FALSE TRUE FALSE
3 19 2016-11-24 FALSE TRUE TRUE FALSE FALSE
4 19 2016-11-26 FALSE TRUE TRUE FALSE TRUE
5 3C 2016-01-16 TRUE FALSE TRUE FALSE FALSE
6 3C 2016-03-18 TRUE FALSE FALSE FALSE FALSE
答案 1 :(得分:0)
fuzzyjoin
软件包是为满足此需求而构建的。如果您想分开代码的每一步,可以使用fuzzy_left_join()
查看匹配项
library(tidyverse)
library(fuzzyjoin)
df_dates <-
df1 %>%
mutate(
Start_Date = ymd(Start_Date),
End_Date = Start_Date %m+% years(2),
Status = "Yes"
)
df_items <-
df2 %>%
mutate(Transaction_Date = as.Date(Transaction_Date))
fuzzy_join(
df_items, df_dates,
by = c("ID" = "ID",
"Transaction_Date" = "Start_Date",
"Transaction_Date" = "End_Date"),
match_fun = list(`==`, `>=`, `<=`)
) %>%
select(ID = ID.x, Item, Start_Date, Status) %>%
distinct() %>%
spread(Item, Status, fill = "No")
# ID Start_Date Crackers Honey PBJ Pop Roku_Stick
#1 14 2018-01-19 No No No yes No
#2 14 2018-03-03 No No No yes No
#3 19 2016-11-24 No Yes Yes No Yes
#4 19 2016-11-26 No Yes Yes No Yes
#5 3C 2016-01-16 Yes No Yes No No
#6 3C 2016-03-18 Yes No No No No