我有一个数据框,我想使用group_by进行比较,但是我需要将它们与组中的所有其他日期进行比较,以生成没有间隙或重叠的路线,因此我只能得到最大终点日期和每个ID的最短开始日期,例如:
ID <- c(1,1,1,3,3,7,7,7,22,22,32,32,173,173,213,213,230,330,330,330,330,150579)
EndDate <- c("9999-12-31","2018-04-30","2015-07-31","9999-12-31","2008-07-26","9999-12-31","9999-12-31","2011-08-31","9999-12-31","2006-11-30","9999-12-31","2007-06-30","9999-12-31","2010-09-30","9999-12-31","2013-04-30","9999-12-31","9999-12-31","2016-12-31","2016-09-30","2015-08-31","9999-12-31")
BegDate <- c("2015-08-01","2017-10-23","1983-12-05","2015-11-12","2003-02-24","2017-04-01","2014-07-15","1991-11-04","2006-12-01","1979-08-01","2007-07-01","1979-08-01","2010-10-01","1987-04-01","1980-10-20","2008-05-01","1983-02-14","1982-01-01","2016-10-01","2015-09-01","2014-02-01","1982-09-01")
df_dates <- data.frame(ID,EndDate,BegDate)
ID EndDate BegDate
1 9999-12-31 2015-08-01
1 2018-04-30 2017-10-23
1 2015-07-31 1983-12-05
3 9999-12-31 2015-11-12
3 9999-12-31 2015-11-12
7 9999-12-31 2017-04-01
7 9999-12-31 2014-07-15
7 2011-08-31 1991-11-04
22 9999-12-31 2006-12-01
22 2006-11-30 1979-08-01
32 9999-12-31 2007-07-01
32 2007-06-30 1979-08-01
173 9999-12-31 2010-10-01
173 2010-09-30 1987-04-01
213 9999-12-31 1980-10-20
213 2013-04-30 2008-05-01
233 9999-12-31 2016-06-01
233 2016-05-31 1998-10-01
330 9999-12-31 1982-01-01
330 2016-12-31 2016-10-01
330 2016-09-30 2015-09-01
330 2015-08-31 2014-02-01
150579 9999-12-31 1982-09-01
我已经尝试过dplyr,但是不知道如何在组的所有元素之间进行比较。我使用了for循环,但是数据帧很大,速度是必须的。
v_result <- c()
for(i in unique(df_dates$ID)){
df_temp <- df_dates[df_dates$ID == i,]
df_temp$EndDate <- as.Date(df_temp$EndDate,"%Y%m%d")
df_temp$BegDate <- as.Date(df_temp$BegDate,"%Y%m%d")
v_row <- (1:nrow(df_temp))
for (j in v_row){
h = j + 1
elm <- v_row[!v_row %in% j]
findNext <- FALSE
for(h in elm){
if((df_temp$EndDate[j] >= df_temp$EndDate[h] AND
df_temp$BegDate[j] <= df_temp$BegDate[h]) |
df_temp$BegDate[j] - days(1) == df_temp$EndDate[h]){
findNext <- TRUE
}
}
v_result <- c(v_result,findNext)
}
}
如您所见,许多for循环非常好,我不太了解apply
函数系列,此外,可能有超过150k的ID,因此不是一个可行的选择。我的想法是将重叠的和代表间隙的那些标记为假,并过滤掉它们,从而允许我采用最大值和最小值
df_final <- df_final%>%
group_by(ID)%>%
mutate(
Biggest = max(EndDate),
Lowest = min(BegDate)
)
生成类似这样的内容:
ID EndDate BegDate
1 9999-12-31 1983-12-05
3 9999-12-31 2015-11-12
7 9999-12-31 2014-07-15
22 9999-12-31 1979-08-01
32 9999-12-31 1979-08-01
173 9999-12-31 2017-07-01
213 9999-12-31 1980-10-20
233 9999-12-31 1998-10-01
330 9999-12-31 1982-01-01
150579 9999-12-31 1982-09-01
“结束日期”不能总是9999-12-31,只要它是ID所对应的最大日期即可,该日期对应于没有间隔且忽略重叠的时间段。我已经为此苦苦挣扎了几天,无法取得任何进展。
有没有办法使用dplyr对大型数据帧有效?
答案 0 :(得分:1)
最终输出背后的逻辑尚不完全清楚。例如,让我们对data.table
(对于较大的数据帧应该是有效的)和magrittr
(为提高可读性)做些事情:
library(data.table)
library(magrittr)
calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))
df_final <- setDT(df_dates) %>%
.[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
.[order(ID, BegDate),] %>%
.[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
.[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
.[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
.[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>%
.[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
.[is.na(gap_between), gap_between := 0] %>%
.[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
.[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>%
#.[, .SD[.N], by = ID] %>%
.[, ("fakeidx") := NULL]
这里的输出是:
ID BegDate EndDate
1: 1 1983-12-05 9999-12-31
2: 3 2003-02-24 2008-07-26
3: 3 2015-11-12 9999-12-31
4: 7 1991-11-04 2011-08-31
5: 7 2014-07-15 9999-12-31
6: 22 1979-08-01 9999-12-31
7: 32 1979-08-01 9999-12-31
8: 173 1987-04-01 9999-12-31
9: 213 1980-10-20 9999-12-31
10: 230 1983-02-14 9999-12-31
11: 330 1982-01-01 9999-12-31
12: 150579 1982-09-01 9999-12-31
如果您查看第二行和第四行,您会发现,根据您的意见,它们不应在那里。
但是,两者之间存在间隙,因此我们不能只取最低的BegDate
,而是需要进入此步骤以产生最终输出。
对于最终输出,可以假设您想在出现间隙之前消除任何东西(即,每组仅获取最后一条记录)。您可以通过简单地取消注释最后一行之前的行来实现此目的,即:
library(data.table)
library(magrittr)
calc_cummax <- function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))
df_final <- setDT(df_dates) %>%
.[, `:=` (BegDate = as.Date(as.character(BegDate), "%Y-%m-%d"),
EndDate = as.Date(as.character(EndDate), "%Y-%m-%d"))] %>%
.[order(ID, BegDate),] %>%
.[, max_until_now := shift(calc_cummax(EndDate)), by = ID] %>%
.[, lead_max := shift(max_until_now, type = "lead"), by = ID] %>%
.[is.na(max_until_now), max_until_now := lead_max, by = ID] %>%
.[(max_until_now + 1L) >= BegDate, gap_between := 0, by = ID] %>%
.[(max_until_now + 1L) < BegDate, gap_between := 1, by = ID] %>%
.[is.na(gap_between), gap_between := 0] %>%
.[, ("fakeidx") := cumsum(gap_between), by = ID] %>%
.[, .(BegDate = min(BegDate), EndDate = max(EndDate)), by = .(ID, fakeidx)] %>%
.[, .SD[.N], by = ID] %>%
.[, ("fakeidx") := NULL]
制作:
ID BegDate EndDate
1: 1 1983-12-05 9999-12-31
2: 3 2015-11-12 9999-12-31
3: 7 2014-07-15 9999-12-31
4: 22 1979-08-01 9999-12-31
5: 32 1979-08-01 9999-12-31
6: 173 1987-04-01 9999-12-31
7: 213 1980-10-20 9999-12-31
8: 230 1983-02-14 9999-12-31
9: 330 1982-01-01 9999-12-31
10: 150579 1982-09-01 9999-12-31
答案 1 :(得分:0)
从您上面用于创建数据框df_dates
的代码开始,以下代码将在底部产生该表:
df_dates <- data.frame(ID,EndDate,BegDate)
df_dates %>%
mutate(EndDate=as.Date(EndDate, "%Y-%m-%d"), #Your as.Date calls above didn't include
BegDate=as.Date(BegDate, "%Y-%m-%d")) %>% #the '-' character between values
group_by(ID) %>%
summarise( #using 'summarise' produces one row per
Biggest = max(EndDate), #grouped 'ID'. 'mutate' keeps all rows.
Lowest = min(BegDate)
)
希望dplyr
如何为您带来预期的结果?
答案 2 :(得分:0)
没有过滤器:
df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% group_by(ID) %>% summarize(max(EndDate), min(BegDate))
在group_by之前带有过滤器(如for循环)。注意使用lead函数将一个日期与下一行的日期进行比较。
df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% filter(EndDate >= lead(EndDate) & BegDate <= lead(BegDate) | BegDate-1 == lead(EndDate) ) %>% group_by(ID) %>% summarize(max(EndDate), min(BegDate))
在group_by之后使用过滤器(由于您的示例意图并不十分清楚)
df_dates %>% unique(by="ID") %>% mutate(EndDate=ymd(EndDate), BegDate=ymd(BegDate)) %>% group_by(ID) %>% filter(EndDate >= lead(EndDate) & BegDate <= lead(BegDate) | BegDate-1 == lead(EndDate) ) %>% summarize(max(EndDate), min(BegDate))