我有一个如下所示的数据集:
ID FromDate ToDate SiteID Cost
1 8/12/2014 8/31/2014 12 245.98
1 9/1/2014 9/7/2014 12 269.35
1 10/10/2014 10/17/2014 12 209.98
1 11/22/2014 11/30/2014 12 309.12
1 12/1/2014 12/11/2014 12 202.14
2 8/16/2014 8/21/2014 12 109.35
2 8/22/2014 8/24/2014 14 44.12
2 9/25/2014 9/29/2014 12 98.75
3 9/15/2014 9/30/2014 23 536.27
3 10/1/2014 10/31/2014 12 529.87
3 11/1/2014 11/30/2014 12 969.55
3 12/1/2014 12/12/2014 12 607.35
我希望这看起来像是:
ID FromDate ToDate SiteID Cost
1 8/12/2014 9/7/2014 12 515.33
1 10/10/2014 10/17/2014 12 209.98
1 11/22/2014 12/11/2014 12 511.26
2 8/16/2014 8/21/2014 12 109.35
2 8/22/2014 8/24/2014 14 44.12
2 9/25/2014 9/29/2014 12 98.75
3 9/15/2014 9/30/2014 23 536.27
3 10/1/2014 12/12/2014 12 2106.77
可以看出,如果存在延续,则会累计日期,并且会计费用ID和SiteID。为了帮助某人理解复杂性,如果日期间隔有延续,但SiteID发生变化,那么它就是一个单独的行。如果日期间隔中没有延续,则它是一个单独的行。我如何在R中执行此操作?此外,我有超过100,000个个人ID。那么最有效的方法/包用于什么呢?
答案 0 :(得分:6)
这可能会
df %>%
mutate(gr = cumsum(FromDate-lag(ToDate, default=1) != 1)) %>%
group_by(gr, ID, SiteID) %>%
summarise(FromDate = min(FromDate),
ToDate = max(ToDate),
cost = sum(Cost))
gr ID SiteID FromDate ToDate cost
(int) (int) (int) (date) (date) (dbl)
1 1 1 12 2014-08-12 2014-09-07 515.33
2 2 1 12 2014-10-10 2014-10-17 209.98
3 3 1 12 2014-11-22 2014-12-11 511.26
4 4 2 12 2014-08-16 2014-08-21 109.35
5 4 2 14 2014-08-22 2014-08-24 44.12
6 5 2 12 2014-09-25 2014-09-29 98.75
7 6 3 23 2014-09-15 2014-09-30 536.27
8 6 3 12 2014-10-01 2014-12-12 2106.77
data.table
library(data.table)
setDT(df)
df[, gr := cumsum(FromDate - shift(ToDate, fill=1) != 1),
][, list(FromDate=min(FromDate), ToDate=max(ToDate), cost=sum(Cost)), by=.(gr, ID, SiteID)]
gr ID SiteID FromDate ToDate cost
1: 1 1 12 2014-08-12 2014-09-07 515.33
2: 2 1 12 2014-10-10 2014-10-17 209.98
3: 3 1 12 2014-11-22 2014-12-11 511.26
4: 4 2 12 2014-08-16 2014-08-21 109.35
5: 4 2 14 2014-08-22 2014-08-24 44.12
6: 5 2 12 2014-09-25 2014-09-29 98.75
7: 6 3 23 2014-09-15 2014-09-30 536.27
8: 6 3 12 2014-10-01 2014-12-12 2106.77
答案 1 :(得分:2)
这是dplyr
和tidyr
的一种方式 - 可能有一些机会来清理它,但前提是创建一个新的群组指标。有一些更好data.table
技能的人可能会想出一些非常光滑的东西。
library(dplyr)
library(tidyr)
df$FromDate <- lubridate::mdy(df$FromDate)
df$ToDate <- lubridate::mdy(df$ToDate)
gather(df, Date, Val, -c(ID, SiteID, Cost)) %>%
arrange(ID, SiteID, Val, Date) %>%
group_by(ID, SiteID) %>%
mutate(lagDateDiff = as.integer(Val - lag(Val)),
indicator = ifelse(Date == "ToDate" | is.na(lagDateDiff), 0,
ifelse((Date == "FromDate" & lagDateDiff == 1), 0, 1)),
newGroup = cumsum(indicator)) %>% # Run to here to see intermediate result
select(-lagDateDiff, -indicator) %>%
spread(Date, Val) %>%
group_by(ID, SiteID, newGroup) %>%
summarise(Min_From_Date = min(FromDate),
Max_To_Date = max(ToDate),
Sum_Cost = sum(Cost))
# ID SiteID newGroup Min_From_Date Max_To_Date Sum_Cost
# (int) (int) (dbl) (date) (date) (dbl)
# 1 1 12 0 2014-08-12 2014-09-07 515.33
# 2 1 12 1 2014-10-10 2014-10-17 209.98
# 3 1 12 2 2014-11-22 2014-12-11 511.26
# 4 2 12 0 2014-08-16 2014-08-21 109.35
# 5 2 12 1 2014-09-25 2014-09-29 98.75
# 6 2 14 0 2014-08-22 2014-08-24 44.12
# 7 3 12 0 2014-10-01 2014-12-12 2106.77
# 8 3 23 0 2014-09-15 2014-09-30 536.27