我想在N天内按ID对数据进行分组。
library(tidyverse)
library(lubridate)
df<- tribble(
~id,~date,
"A","2014-07-19",
"A","2014-07-20",
"A","2014-07-21",
"B","2014-10-01",
"B","2014-10-03",
"B","2014-10-08",
"B","2014-10-10",
"B","2014-10-13",
"B","2014-10-17",
"B","2014-11-02",
"B","2014-11-04",
"B","2014-11-06",
"C","2013-03-27",
"C","2013-03-28",
"C","2013-04-01",
"C","2013-04-03",
"C","2013-04-05",
"C","2013-04-07",
"C","2013-05-27",
"C","2013-05-29",
"D","2015-01-09",
"D","2015-01-12",
"D","2015-01-14",
"D","2015-01-16"
) %>% mutate_at(vars(date),funs(ymd(.)))
所需的输出看起来像这样
id first_date last_date Count A 2014-07-19 2014-07-21 3 B 2014-10-01 2014-10-03 2 B 2014-10-08 2014-10-13 3 B 2014-10-17 2014-10-17 1 B 2014-11-02 2014-11-06 3 C 2013-03-27 2013-03-28 2 C 2013-04-01 2013-04-07 4 C 2013-05-27 2013-05-29 2 D 2015-01-09 2015-01-16 4
我的解决方法是:
df %>% group_by(id) %>%
mutate(diff=as.numeric(date-lag(date,default=first(date)))) %>%
mutate(diff=if_else(diff>3,0,diff)) %>%
mutate(rank=min_rank(cumsum(diff=='0'))) %>%
group_by(id,rank) %>%
summarise(first_date=min(date), last_date=max(date), Count=length(id)) %>%
data.frame()
我的解决方案可以工作,但我想知道是否有任何更简单/优雅的方法来存档此文件?例如删除diff和rank之类的临时列。
答案 0 :(得分:2)
我认为您可以在线进行group_by
计算:
df %>%
group_by(id, counter=cumsum(c(FALSE,diff(date)>3))) %>%
summarise(first_date = first(date), last_date = last(date), count = n()) %>%
select(-counter)
## A tibble: 9 x 4
## Groups: id [4]
# id first_date last_date count
# <chr> <date> <date> <int>
#1 A 2014-07-19 2014-07-21 3
#2 B 2014-10-01 2014-10-03 2
#3 B 2014-10-08 2014-10-13 3
#4 B 2014-10-17 2014-10-17 1
#5 B 2014-11-02 2014-11-06 3
#6 C 2013-03-27 2013-03-28 2
#7 C 2013-04-01 2013-04-07 4
#8 C 2013-05-27 2013-05-29 2
#9 D 2015-01-09 2015-01-16 4