我有以下数据:
ID GROUP DATE
A GR1 12/01/2013
A GR1 09/04/2014
A GR1 01/03/2015
A GR2 04/04/2015
A GR2 08/21/2015
A GR1 01/05/2016
A GR1 06/28/2016
B GR2 11/01/2013
B GR2 06/04/2014
B GR2 04/15/2015
B GR3 11/04/2015
B GR2 03/21/2016
B GR2 07/05/2016
B GR1 06/28/2016
C GR2 01/16/2014
C GR2 06/04/2014
C GR2 04/15/2015
C GR3 11/04/2015
C GR2 03/21/2016
C GR2 06/05/2016
C GR1 06/28/2016
我想让每个小组中的人保持不同。所以新表格如下所示:
ID GROUP DATE Diff
A GR1 12/01/2013
A GR1 09/04/2014
A GR1 01/03/2015 398
A GR2 04/04/2015
A GR2 08/21/2015 139
A GR1 01/05/2016
A GR1 06/28/2016 175
B GR2 11/01/2013
B GR2 06/04/2014
B GR2 04/15/2015 530
B GR3 11/04/2015
B GR2 03/21/2016
B GR2 07/05/2016 106
B GR1 06/28/2016
C GR2 01/16/2014
C GR2 06/04/2014
C GR2 04/15/2015 454
C GR3 11/04/2015
C GR2 03/21/2016
C GR2 01/05/2016 76
C GR1 06/28/2016
“Diff”398栏中的值是通过差异'01 / 03/2015' - '12 / 1/2013'来实现的。同样所有其他差异。
现在我的问题是如何获得这种差异?我不能在每个组中取max(date)-min(date),因为group在不同的时间段重复。同样地,我不能像SAS那样采用第一个点和最后一个点。
如果有人帮我解决这个问题,我将非常感激。我更喜欢SAS中的解决方案,因为数据量非常大。所以不会留在记忆中。
此致
答案 0 :(得分:6)
library(dplyr)
library(data.table)
df$xxx = rleidv(df[, c("ID","GROUP"),with = FALSE ])
df$DATE = as.Date(df$DATE, format = "%m/%d/%Y")
df %>% group_by(xxx) %>% mutate(diff = max(DATE) - min(DATE)) %>%
ungroup(xxx) %>% mutate(xxx = NULL)
# ID GROUP DATE diff
# <chr> <chr> <date> <time>
#1 A GR1 2013-12-01 398 days
#2 A GR1 2014-09-04 398 days
#3 A GR1 2015-01-03 398 days
#4 A GR2 2015-04-04 139 days
#5 A GR2 2015-08-21 139 days
#6 A GR1 2016-01-05 175 days
#7 A GR1 2016-06-28 175 days
#8 B GR2 2013-11-01 530 days
#9 B GR2 2014-06-04 530 days
#10 B GR2 2015-04-15 530 days
仅使用data.table
:
library(data.table)
df[, diff := max(DATE)-min(DATE),by = c("xxx")][,xxx:=NULL]
答案 1 :(得分:5)
使用SAS做这件事是微不足道的。使用RETAIN保持组的第一个记录的开始日期。您的数据未显示排序,因此要么先排序,要么保留当前顺序(并且组内的记录已按日期排序),那么您可以使用NOTSORTED
上的BY
选项声明。
data want ;
set have ;
by id group notsorted;
if first.group then start = date ;
else if last.group then diff = date - start ;
retain start;
drop start;
run;
如果您需要保留当前订单,但日期未在组内排序,那么要发现组中的最小和最大日期,您需要添加另一个变量和更多逻辑。
data want ;
set have ;
by id group notsorted;
if first.group then start = date ;
if first.group then stop = date ;
start = min(start,date);
stop = max(stop,date);
if last.group and not first.group then diff = stop - start ;
retain start stop;
drop start stop;
run;
答案 2 :(得分:2)
data want(drop=_:);
merge have have(firstobs=2 rename=(id=_id group=_group date=_date));
retain _temp;
_temp= min(_temp,date);
if id^=_id or group^=_group then do;
diff=intck('day',_temp,date);
if diff=0 then call missing(diff);
_temp=_date;
end;
run;