选择组中的最后一个非同期日期

时间:2020-08-27 18:56:32

标签: r dplyr mutate

人们正在购买东西,我有一个人上次用邮政编码购买该商品的日期。我想获取该组中的最后一个非同期日期。

ZCTA5 = c("b", "c", "a", "b", "b", "c", "a", "a", "a", "c")
App.Complete.Date = c("2005-01-23", "2005-01-23",
                      "2006-07-13", "2006-11-21",
                      "2006-11-21", "2006-11-21",
                      "2007-01-01", "2007-01-01",
                      "2007-01-01", "2007-01-01")
xxx <- data.frame(ZCTA5,App.Complete.Date) %>% 
     arrange(ZCTA5,App.Complete.Date); xxx
 Last.Unique.Date.In.ZCTA5 =c(NA, "2006-07-13", "2006-07-13", "2006-07-13", NA, "2005-01-23", 
                         "2005-01-23", NA, "2005-01-23", "2006-11-21") 

所需的输出

   ZCTA5 App.Complete.Date Last.Unique.Date.In.ZCTA5
1      a        2006-07-13                      <NA>
2      a        2007-01-01                2006-07-13
3      a        2007-01-01                2006-07-13
4      a        2007-01-01                2006-07-13
5      b        2005-01-23                      <NA>
6      b        2006-11-21                2005-01-23
7      b        2006-11-21                2005-01-23
8      c        2005-01-23                      <NA>
9      c        2006-11-21                2005-01-23
10     c        2007-01-01                2006-11-21

我不想放弃任何意见。进行适当的突变比较理想,但是我知道通过ZCTA5加入(​​以后没有显示,但是我确实有)个人ID以后就可以了。

我无法通过滞后于唯一的App.Complete.Date值找出一种方法来对新变量进行变异,因此我陷入了困境。另外,切片太麻烦了,因为我仍然需要最后一个日期而不删除同时期的日期。

编辑:如果NA是同一行的App.Complete.Date,则可以接受。

1 个答案:

答案 0 :(得分:1)

尝试以下操作:

xxx = xxx %>% 
  mutate(App.Complete.Date = as.Date(App.Complete.Date),
         rn = row_number())

用于确保日期列为日期类型的初始设置。添加行号以保留原始的重复日期。

yyy = xxx %>%
  left_join(xxx, by = "ZCTA5") %>%
  # discard all the out-of-scope dates
  mutate(App.Complete.Date.y = ifelse(App.Complete.Date.y < App.Complete.Date.x,
                                      App.Complete.Date.y, NA)) %>%
  # we need to include row number here to preserve all rows in the original
  group_by(ZCTA5, App.Complete.Date.x, rn.x) %>%
  # na.rm = TRUE handles all the missing values removed in the previous mutate
  summarise(App.Complete.Date.y = max(App.Complete.Date.y, na.rm = TRUE), .groups = 'drop') %>%
  # summarise may return numeric type rather than date type - convert back
  mutate(App.Complete.Date.y = as.Date(App.Complete.Date.y, origin = "1970-01-01")) %>%
  # rename to output
  select(ZCTA5,
         App.Complete.Date = App.Complete.Date.x,
         Last.Unique.Date.In.ZCTA5 = App.Complete.Date.y)

您可能需要在最后一个突变中更改origin参数,具体取决于系统中的基准日期。当我的计算机返回13342而不是“ 2006-07-13”时,我确定基准日期为“ 1970-01-01”,因为“ 2006-07-13”是“ 1970-01-01”之后的13342天。 >