Question

我有一个包含两列的数据集，ID和Start_Date如下所示

  ID        Start_Date
  19        2016-11-24
  19        2016-11-26
  3C        2016-01-16
  3C        2016-03-18
  14        2018-03-03
  14        2018-01-19

第二个数据集，其中包含每个ID在不同日期的一些随机购买数据

  ID      Transaction_Date     Item
  19      2015-10-24           Pop
  19      2015-12-11           Crackers
  19      2017-11-25           Honey  
  19      2018-03-14           PBJ
  19      2018-11-24           Roku_Stick
  19      2019-01-10           Pop
  19      2019-02-15           LipBalm  
  19      2019-03-25           Pop
  3C      2015-04-16           Honey
  3C      2016-02-20           PBJ
  3C      2016-08-04           Crackers
  3C      2019-05-12           Roku_Stick          
  14      2017-07-11           Pop   
  14      2018-09-26           Pop

我的意图是

1）通过ID合并两个数据集，这很容易，我知道我们可以使用merge函数df_result <- merge(df1, df2, by = "ID", all = TRUE)

来做到这一点。

2）对于每个ID，仅保留第一数据集中Start_Date两年内的第二数据集中的行。

我的意思是，考虑数据集1中的第一个观察值，例如ID 19，StartDate是2016-10-24。因此，包含了第二个数据集中的这些行，而排除了这些行

  ID      Transaction_Date   Item         Status
  19      2015-10-24          Pop         Exclude, because earlier than start date 2016-11-24
  19      2015-10-24          Crackers    Exclude, because earlier than start date 2016-11-24      
  19      2017-11-25         Honey        Include, because transaction occurs after the start date 2016-11-24  and within 2 years of 2016-10-24 
  19      2018-03-14         PBJ          Include, because transaction occurs after the start date 2016-11-24  and within 2 years of 2016-10-24 
  19      2018-11-24         Roku_Stick   Include, because transaction occurs after the start date 2016-11-24  and within 2 years of 2016-10-24 
  19      2019-01-10         Pop          Exclude, because transaction is after 2 years of start date 2016-11-24
  19      2019-02-15         Lip Balm     Exclude, because transaction is after 2 years of start date 2016-11-24 
  19      2019-03-25         Pop          Exclude, because transaction is after 2 years of start date 2016-11-24

最终预期数据集

   ID      Start_Date   Pop   Crackers  Honey  PBJ  Roku_Stick  Lip Balm
   19      2017-11-24   No    Yes       Yes    Yes  Yes         No

类似地

   ID      Start_Date   Pop   Crackers  Honey  PBJ  Roku_Stick  LipBalm
   19      2016-11-26   No    Yes       Yes    Yes  Yes         No
   3C      2016-01-16   No    Yes       No     Yes  No          No
   14      2018-03-03   Yes   No        No     No   No          No 
   14      2018-01-19   Yes   No        No     No   No          No

我知道使用

进行此操作的时间很长

merge，

if-else Start_Date +2 <= Transaction_Date, Include, Exclude，

df <- df[ subset(Include),]

df <- long to wide.

我有兴趣探索一种非常有效的方法来转换此数据集。非常感谢您的协助。预先感谢。

########可重现数据集

df1 <- structure(list(ID = structure(c(2L, 2L, 3L, 3L, 1L, 1L), .Label = c("14", 
"19", "3c"), class = "factor"), Start_Date = structure(c(3L, 
4L, 1L, 2L, 6L, 5L), .Label = c("2016-01-16", "2016-03-18", "2016-11-24", 
"2016-11-26", "2018-01-19", "2018-03-03"), class = "factor")), .Names = c("ID", 
"Start_Date"), row.names = c(NA, -6L), class = "data.frame")

df2 <- structure(list(ID = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 1L, 1L), .Label = c("14", "19", "3C"), class = "factor"), 
    Transaction_Date = structure(c(2L, 3L, 7L, 8L, 10L, 11L, 
    12L, 13L, 1L, 4L, 5L, 14L, 6L, 9L), .Label = c("2015-04-16", 
    "2015-10-24", "2015-12-11", "2016-02-20", "2016-08-04", "2017-07-11", 
    "2017-11-25", "2018-03-14", "2018-09-26", "2018-11-24", "2019-01-10", 
    "2019-02-15", "2019-03-25", "2019-05-12"), class = "factor"), 
    Item = structure(c(6L, 1L, 3L, 5L, 7L, 6L, 4L, 6L, 3L, 5L, 
    2L, 7L, 6L, 6L), .Label = c("Crackers", "Crakerss", "Honey", 
    "LipBalm", "PBJ", "Pop", "Roku_Stick"), class = "factor")), .Names = c("ID", 
"Transaction_Date", "Item"), row.names = c(NA, -14L), class = "data.frame")

Answer 1

这是一个tidyverse解决方案。首先，我们加入，然后将日期转换为Date对象。接下来，我们使用一些filter工具（lubridate）应用两个%m+% years(2)约束，select我们要保留的列，创建一个额外的列，其中所有内容均为{{1 }}，这样我们就可以TRUE进入每个项目的列。 spread用fill = F而不是FALSE填充缺失值。

NA

数据：

library(lubridate)
library(dplyr)
library(tidyr)

df2 %>% 
  dplyr::left_join(df1, by = "ID") %>% 
  dplyr::mutate(Transaction_Date = as.Date(Transaction_Date),
         Start_Date = as.Date(Start_Date)) %>% 
  dplyr::filter(Transaction_Date < (Start_Date %m+% years(2)) & Transaction_Date >= Start_Date) %>% 
  dplyr::select(ID, Start_Date, Item) %>% 
  dplyr::mutate(ItemTrue = TRUE) %>% 
  tidyr::spread(Item, ItemTrue, fill = F)

  ID Start_Date Crackers Honey   PBJ   Pop Roku_Stick
1 14 2018-01-19    FALSE FALSE FALSE  TRUE      FALSE
2 14 2018-03-03    FALSE FALSE FALSE  TRUE      FALSE
3 19 2016-11-24    FALSE  TRUE  TRUE FALSE      FALSE
4 19 2016-11-26    FALSE  TRUE  TRUE FALSE       TRUE
5 3C 2016-01-16     TRUE FALSE  TRUE FALSE      FALSE
6 3C 2016-03-18     TRUE FALSE FALSE FALSE      FALSE

Answer 2

fuzzyjoin软件包是为满足此需求而构建的。如果您想分开代码的每一步，可以使用fuzzy_left_join()查看匹配项

library(tidyverse)
library(fuzzyjoin)

df_dates <-
  df1 %>% 
  mutate(
    Start_Date = ymd(Start_Date),
    End_Date = Start_Date %m+% years(2),
    Status = "Yes"
  )

df_items <-
  df2 %>% 
  mutate(Transaction_Date = as.Date(Transaction_Date))

fuzzy_join(
  df_items, df_dates,
  by = c("ID" = "ID", 
         "Transaction_Date" = "Start_Date",
         "Transaction_Date" = "End_Date"),
  match_fun = list(`==`, `>=`, `<=`)
) %>%
select(ID = ID.x, Item, Start_Date, Status) %>%
distinct() %>%
spread(Item, Status, fill = "No")

#  ID Start_Date Crackers Honey PBJ Pop Roku_Stick
#1 14 2018-01-19       No    No  No yes         No
#2 14 2018-03-03       No    No  No yes         No
#3 19 2016-11-24       No   Yes Yes  No        Yes
#4 19 2016-11-26       No   Yes Yes  No        Yes
#5 3C 2016-01-16      Yes    No Yes  No         No
#6 3C 2016-03-18      Yes    No  No  No         No

r根据时间限制合并并创建一个数据框

2 个答案: