如何分割数据集而不重复R中的数据?

时间:2017-10-15 23:35:15

标签: r

我希望通过以下管理的某些重叠日期范围对分组数据进行分段。

library(dplyr)

## Create data frames
df_A = data.frame( "ID" = rep("A" , 5) , "Date" = c( "2000-01-03" , "2000-02-03" , "2000-04-01" , "2000-05-03" ,"2000-05-04" ) , "Var_1"=c(1,2,3,4,5) ) 

df_B = data.frame( "ID" = rep("B" , 5) , "Date" = c( "2000-01-03" , "2000-01-04" , "2000-01-05" , "2000-03-02" ,"2000-04-01" ) , "Var_1"=c(6,7,8,9,10) )

df_C = data.frame( "ID" = rep("C" , 5) , "Date" = c( "2000-01-03" , "2000-02-03" , "2000-03-01" , "2000-04-03" ,"2000-05-04" ) , "Var_1"=c(11,12,13,14,15) )

## Bind and group data frames together via ID
mydf = bind_rows( df_A , df_B , df_C ) %>% group_by( ID )

## Create date range
filterDates = data.frame( "start" = c("2000-01-01" , "2000-02-01","2000-03-01","2000-04-01" ) , "end" = c( "2000-02-29","2000-03-31","2000-04-30","2000-05-31" ) )

## Segment data according to date range
segmented_df = apply( filterDates , 1 , function(x) filter( mydf , Date>= as.Date (x["start"]) & Date<=x["end"]  ) )

但是,该过程会在某些列表中创建重复的数据。

## For e.g.
segmented_df[[2]][1,] ## This was already in segmented_df[[1]][2,]

如何在避免重复数据的同时这样做?

我想过使用group_by(ID,Date)但不考虑日期范围。

注意:我不是在寻找特定形式的解决方案,但如果它具有内存效率并且“轻松”调用每个完整的分段组,那将更为可取。

如果我使用了任何错误的条款,我会提前道歉。

1 个答案:

答案 0 :(得分:0)

您可以尝试使用unique而不是group_by

 mydf = bind_rows( df_A , df_B , df_C ) %>% unique( ID )