dplyr在N天之内的组数据(id)

时间:2018-07-21 08:09:02

标签: r dplyr

我想在N天内按ID对数据进行分组。

library(tidyverse)
library(lubridate)

df<- tribble(
~id,~date,
"A","2014-07-19",
"A","2014-07-20",
"A","2014-07-21",
"B","2014-10-01",
"B","2014-10-03",
"B","2014-10-08",
"B","2014-10-10",
"B","2014-10-13",
"B","2014-10-17",
"B","2014-11-02",
"B","2014-11-04",
"B","2014-11-06",
"C","2013-03-27",
"C","2013-03-28",
"C","2013-04-01",
"C","2013-04-03",
"C","2013-04-05",
"C","2013-04-07",
"C","2013-05-27",
"C","2013-05-29",
"D","2015-01-09",
"D","2015-01-12",
"D","2015-01-14",
"D","2015-01-16"
) %>% mutate_at(vars(date),funs(ymd(.)))

所需的输出看起来像这样

id  first_date  last_date    Count  
A   2014-07-19  2014-07-21   3      
B   2014-10-01  2014-10-03   2      
B   2014-10-08  2014-10-13   3      
B   2014-10-17  2014-10-17   1      
B   2014-11-02  2014-11-06   3      
C   2013-03-27  2013-03-28   2      
C   2013-04-01  2013-04-07   4      
C   2013-05-27  2013-05-29   2      
D   2015-01-09  2015-01-16   4

我的解决方法是:

df %>% group_by(id) %>% 
mutate(diff=as.numeric(date-lag(date,default=first(date)))) %>% 
mutate(diff=if_else(diff>3,0,diff)) %>% 
mutate(rank=min_rank(cumsum(diff=='0'))) %>% 
group_by(id,rank) %>% 
summarise(first_date=min(date), last_date=max(date), Count=length(id)) %>% 
data.frame()

我的解决方案可以工作,但我想知道是否有任何更简单/优雅的方法来存档此文件?例如删除diff和rank之类的临时列。

1 个答案:

答案 0 :(得分:2)

我认为您可以在线进行group_by计算:

df %>%
    group_by(id, counter=cumsum(c(FALSE,diff(date)>3))) %>%
    summarise(first_date = first(date), last_date = last(date), count = n()) %>%
    select(-counter)

## A tibble: 9 x 4
## Groups:   id [4]
#  id    first_date last_date  count
#  <chr> <date>     <date>     <int>
#1 A     2014-07-19 2014-07-21     3
#2 B     2014-10-01 2014-10-03     2
#3 B     2014-10-08 2014-10-13     3
#4 B     2014-10-17 2014-10-17     1
#5 B     2014-11-02 2014-11-06     3
#6 C     2013-03-27 2013-03-28     2
#7 C     2013-04-01 2013-04-07     4
#8 C     2013-05-27 2013-05-29     2
#9 D     2015-01-09 2015-01-16     4