我有一个看起来像这样的数据集:
x = data.frame(id = c("A","A","A","A","B","B","B","B"), group = c(1,1,2,2,3,3,4,4),
date1 = c("25/03/2017", "26/03/2017","03/04/2017","04/04/2017",
"04/05/2017","26/08/2017","28/08/2017","30/08/2017"),
date2 = c("26/03/2017","29/03/2017","04/04/2017","04/05/2017",
"18/05/2017","28/08/2017","29/08/2017","31/08/2017")
)
> x
id group date1 date2
1 A 1 25/03/2017 26/03/2017
2 A 1 26/03/2017 29/03/2017
3 A 2 03/04/2017 04/04/2017
4 A 2 04/04/2017 04/05/2017
5 B 3 04/05/2017 18/05/2017
6 B 3 26/08/2017 28/08/2017
7 B 4 28/08/2017 29/08/2017
8 B 4 30/08/2017 31/08/2017
我想做的是让每个人获得第二组中date1中第一个日期的日期与前一组中date2中最后一个日期的日期差。因此,例如,具有id = A的人,我想获得天差“ 03/04/2017”和“ 29/03/2017”。患者B也是一样。我每个人都有多个小组。 我想得到一个像这样的数据集:
y = data.frame(id = c("A","A","B","B"), group = c(1,2,3,4),
date1 = c("26/03/2017","03/04/2017","26/08/2017","28/08/2017"),
date2 = c("29/03/2017","04/04/2017","28/08/2017","29/08/2017"),
datediff = c(NA,5,NA,0)
)
> y
id group date1 date2 datediff
1 A 1 26/03/2017 29/03/2017 NA
2 A 2 03/04/2017 04/04/2017 5
3 B 3 26/08/2017 28/08/2017 NA
4 B 4 28/08/2017 29/08/2017 0
为此,我四处张望,发现并回答了减去同一组的第一个和最后一个观察结果,但不同组的最后一个和第一个观察结果却无所作为。任何帮助将非常感激。谢谢。
答案 0 :(得分:0)
这是一种更通用的方法,应该对每个id
使用3个以上的组和/或每个组3个以上的行:
library(dplyr)
library(lubridate)
# update dates (if needed)
x = x %>% mutate_at(vars(matches("date")), dmy)
# get appropriate rows based on first group
x1 = x %>%
group_by(id) %>%
filter(group == min(group)) %>%
filter(date1 == max(date1)) %>%
ungroup()
# get appropriate rows based on last group
x2 = x %>%
group_by(id) %>%
filter(group == max(group)) %>%
filter(date2 == min(date2)) %>%
ungroup()
# combine datasets and calculate date difference
x1 %>%
bind_rows(x2) %>%
arrange(id, group) %>%
group_by(id) %>%
mutate(datediff = as.numeric(date1 - lag(date2))) %>%
ungroup()
# # A tibble: 4 x 5
# id group date1 date2 datediff
# <fct> <dbl> <date> <date> <dbl>
# 1 A 1 2017-03-26 2017-03-29 NA
# 2 A 2 2017-04-03 2017-04-04 5
# 3 B 3 2017-08-26 2017-08-28 NA
# 4 B 4 2017-08-28 2017-08-29 0
答案 1 :(得分:0)
使用lubridate::dmy
解析字符串日期。然后,您可以使用dplyr
来计算date1
和date2
的滞后值之间的差。
最后,过滤代表新组的那些行。
library(dplyr)
library(lubridate)
x = data.frame(id = c("A","A","A","A","B","B","B","B"), group = c(1,1,2,2,3,3,4,4),
date1 = dmy(c("25/03/2017", "26/03/2017","03/04/2017","04/04/2017",
"04/05/2017","26/08/2017","28/08/2017","30/08/2017")),
date2 = dmy(c("26/03/2017","29/03/2017","04/04/2017","04/05/2017",
"18/05/2017","28/08/2017","29/08/2017","31/08/2017"))
)
x %>%
group_by(id) %>%
filter(group != lag(group) | group != lead(group)) %>%
mutate(diff = date1 - lag(date2)) %>%
ungroup()
# A tibble: 4 x 5
id group date1 date2 diff
<fct> <dbl> <date> <date> <time>
1 A 1 2017-03-26 2017-03-29 NA days
2 A 2 2017-04-03 2017-04-04 " 5 days"
3 B 3 2017-08-26 2017-08-28 NA days
4 B 4 2017-08-28 2017-08-29 " 0 days"
如果要数字输出,请使用mutate(diff = as.numeric(date1 - lag(date2)))
。只要您对数据进行排序(x <- x[with(x, order(id, group)), ]
),无论有多少个人和组,它都可以正常工作。