我有一个大型数据集。
我想创建一个列,以显示每个重复ID的开始日期和结束日期(从上一行开始)之间的天数。 例如,对于R1,由于它没有重复,我不会计算间隔。 对于R2,首先,我需要根据开始日期对其进行排序。然后我计算第二个最早开始日期和上一行结束日期之间的天数。接下来,我将继续计算从第二个最早开始日期起第三个最早开始日期和结束日期之间的天数,依此类推。我也想为任何其他重复的ID做这件事。
然后我想创建一个新列,以与第一部分相同的方式计算天数,用于具有相同事件级别的重复ID。 我想知道如何做到这一点。
ID<-c("R1","R2","R2","R3","R3","R4","R4","R4","R4","R3","R3","R3","R3","R2","R2","R2","R5","R6")
START<-c("3-4-2013","4-5-2018","4-5-2015","4-6-2011","5-5-2012","1-9-2010","23-4-1999","25-6-2011","3-6-2011","4-5-2014",
"6-6-2016","5-7-2014","7-7-1990","3-3-1998","4-4-1990","7-8-2014","22-4-1970","23-5-1984")
End<-c("3-4-2014","4-5-2019","5-5-2015","4-6-2013","5-5-2014","1-9-2012","23-4-2010","25-6-2015","3-6-2013","6-5-2014",
"6-8-2016","5-8-2014","7-9-1990","3-7-1998","4-9-1990","7-12-2014","22-7-1970","23-8-1984")
event<-c("a","b","b","s","s","f","f","b","b","a","a","a","s","c","c","b","m","a")
df<-data.frame(ID,START,End,event)
所以结果会是这样的:
ID START End event Time1 Time2
1 R1 3-4-2013 3-4-2014 a NA NA
14 R2 3-3-1998 3-7-1998 c NA NA
15 R2 4-4-1990 4-9-1990 c (4-4-1990)-(3-7-1998) (4-4-1990)-(3-7-1998)
3 R2 4-5-2015 5-5-2015 b (4-5-2015)-(4-9-1990) NA
2 R2 4-5-2018 4-5-2019 b (4-5-2018)-(5-5-2015) (4-5-2018)-(5-5-2015)
16 R2 7-8-2014 7-12-2014 b (7-8-2014)-(4-5-2019) (7-8-2014)-(4-5-2019)
10 R3 4-5-2014 6-5-2014 a NA NA
4 R3 4-6-2011 4-6-2013 s (4-6-2011)-(6-5-2014) NA
5 R3 5-5-2012 5-5-2014 s (5-5-2012)-(4-6-2013) (5-5-2012)-(4-6-2013)
12 R3 5-7-2014 5-8-2014 a (5-7-2014)-(5-5-2014) (5-7-2014)-(6-5-2014)
11 R3 6-6-2016 6-8-2016 a (6-6-2016)-(5-8-2014) (6-6-2016)-(5-8-2014)
13 R3 7-7-1990 7-9-1990 s (7-7-1990)-(5-5-2014)
6 R4 1-9-2010 1-9-2012 f
7 R4 23-4-1999 23-4-2010 f
8 R4 25-6-2011 25-6-2015 b
9 R4 3-6-2011 3-6-2013 b
17 R5 22-4-1970 22-7-1970 m
18 R6 23-5-1984 23-8-1984 a
>
答案 0 :(得分:1)
实现此目的的一种方法是使用dplyr
包,如下所示(修复数据框后如下所示):
library(dplyr)
df<-data.frame(ID,START,End,event, stringsAsFactors = FALSE)
df$START <- as.Date(df$START, format = '%d-%m-%Y')
df$End <- as.Date(df$End, format = '%d-%m-%Y')
df %>% arrange(ID, START, End) %>% group_by(ID) %>% mutate(laggedTimeElapsed = difftime(START, lag(End), units = 'days'))
不确定上面#2中你想要的是什么,但是,如果你想在给定的行中创建'事件持续时间',你只需执行以下操作:
df %>% arrange(ID, START, End) %>% group_by(ID) %>% mutate(laggedTimeElapsed = difftime(START, lag(End), units = 'days'), eventDuration = difftime(End, START, units = 'days'))
此处输出:
Source: local data frame [18 x 6]
Groups: ID [6]
ID START End event laggedTimeElapsed eventDuration
(chr) (date) (date) (chr) (dfft) (dfft)
1 R1 2013-04-03 2014-04-03 a NA days 365 days
2 R2 1990-04-04 1990-09-04 c NA days 153 days
3 R2 1998-03-03 1998-07-03 c 2737 days 122 days
4 R2 2014-08-07 2014-12-07 b 5879 days 122 days
5 R2 2015-05-04 2015-05-05 b 148 days 1 days
6 R2 2018-05-04 2019-05-04 b 1095 days 365 days
7 R3 1990-07-07 1990-09-07 s NA days 62 days
8 R3 2011-06-04 2013-06-04 s 7575 days 731 days
9 R3 2012-05-05 2014-05-05 s -395 days 730 days
10 R3 2014-05-04 2014-05-06 a -1 days 2 days
11 R3 2014-07-05 2014-08-05 a 60 days 31 days
12 R3 2016-06-06 2016-08-06 a 671 days 61 days
13 R4 1999-04-23 2010-04-23 f NA days 4018 days
14 R4 2010-09-01 2012-09-01 f 131 days 731 days
15 R4 2011-06-03 2013-06-03 b -456 days 731 days
16 R4 2011-06-25 2015-06-25 b -709 days 1461 days
17 R5 1970-04-22 1970-07-22 m NA days 91 days
18 R6 1984-05-23 1984-08-23 a NA days 92 days