获取组中第一个日期与上一个组中最后一个日期之间的日期差

时间:2018-11-27 12:22:12

标签: r dplyr data-manipulation lubridate

我有一个看起来像这样的数据集:

x = data.frame(id = c("A","A","A","A","B","B","B","B"), group = c(1,1,2,2,3,3,4,4),
               date1 = c("25/03/2017",  "26/03/2017","03/04/2017","04/04/2017",
                         "04/05/2017","26/08/2017","28/08/2017","30/08/2017"),    
               date2 = c("26/03/2017","29/03/2017","04/04/2017","04/05/2017",
                         "18/05/2017","28/08/2017","29/08/2017","31/08/2017")
                )
> x
  id group      date1      date2
1  A     1 25/03/2017 26/03/2017
2  A     1 26/03/2017 29/03/2017
3  A     2 03/04/2017 04/04/2017
4  A     2 04/04/2017 04/05/2017
5  B     3 04/05/2017 18/05/2017
6  B     3 26/08/2017 28/08/2017
7  B     4 28/08/2017 29/08/2017
8  B     4 30/08/2017 31/08/2017

我想做的是让每个人获得第二组中date1中第一个日期的日期与前一组中date2中最后一个日期的日期差。因此,例如,具有id = A的人,我想获得天差“ 03/04/2017”和“ 29/03/2017”。患者B也是一样。我每个人都有多个小组。 我想得到一个像这样的数据集:

y = data.frame(id = c("A","A","B","B"), group = c(1,2,3,4),
               date1 = c("26/03/2017","03/04/2017","26/08/2017","28/08/2017"),    
               date2 = c("29/03/2017","04/04/2017","28/08/2017","29/08/2017"),
               datediff = c(NA,5,NA,0)
              ) 
> y
  id group      date1      date2 datediff
1  A     1 26/03/2017 29/03/2017       NA
2  A     2 03/04/2017 04/04/2017        5
3  B     3 26/08/2017 28/08/2017       NA
4  B     4 28/08/2017 29/08/2017        0

为此,我四处张望,发现并回答了减去同一组的第一个和最后一个观察结果,但不同组的最后一个和第一个观察结果却无所作为。任何帮助将非常感激。谢谢。

2 个答案:

答案 0 :(得分:0)

这是一种更通用的方法,应该对每个id使用3个以上的组和/或每个组3个以上的行:

library(dplyr)
library(lubridate)

# update dates (if needed)
x = x %>% mutate_at(vars(matches("date")), dmy)

# get appropriate rows based on first group 
x1 = x %>%
  group_by(id) %>%
  filter(group == min(group)) %>%
  filter(date1 == max(date1)) %>%
  ungroup()

# get appropriate rows based on last group 
x2 = x %>%
  group_by(id) %>%
  filter(group == max(group)) %>%
  filter(date2 == min(date2)) %>%
  ungroup()

# combine datasets and calculate date difference
x1 %>%
  bind_rows(x2) %>%
  arrange(id, group) %>%
  group_by(id) %>%
  mutate(datediff = as.numeric(date1 - lag(date2))) %>%
  ungroup()

# # A tibble: 4 x 5
#   id    group date1      date2      datediff
#   <fct> <dbl> <date>     <date>        <dbl>
# 1 A         1 2017-03-26 2017-03-29       NA
# 2 A         2 2017-04-03 2017-04-04        5
# 3 B         3 2017-08-26 2017-08-28       NA
# 4 B         4 2017-08-28 2017-08-29        0

答案 1 :(得分:0)

使用lubridate::dmy解析字符串日期。然后,您可以使用dplyr来计算date1date2的滞后值之间的差。 最后,过滤代表新组的那些行。

library(dplyr)
library(lubridate)
x = data.frame(id = c("A","A","A","A","B","B","B","B"), group = c(1,1,2,2,3,3,4,4),
               date1 = dmy(c("25/03/2017",  "26/03/2017","03/04/2017","04/04/2017",
                         "04/05/2017","26/08/2017","28/08/2017","30/08/2017")),    
               date2 = dmy(c("26/03/2017","29/03/2017","04/04/2017","04/05/2017",
                         "18/05/2017","28/08/2017","29/08/2017","31/08/2017"))
)



x %>%
  group_by(id) %>%
  filter(group != lag(group) | group != lead(group)) %>%
  mutate(diff = date1 - lag(date2)) %>%
  ungroup()



# A tibble: 4 x 5
  id    group date1      date2      diff     
  <fct> <dbl> <date>     <date>     <time>   
1 A         1 2017-03-26 2017-03-29 NA days  
2 A         2 2017-04-03 2017-04-04 " 5 days"
3 B         3 2017-08-26 2017-08-28 NA days  
4 B         4 2017-08-28 2017-08-29 " 0 days"

如果要数字输出,请使用mutate(diff = as.numeric(date1 - lag(date2)))。只要您对数据进行排序(x <- x[with(x, order(id, group)), ]),无论有多少个人和组,它都可以正常工作。