R:使用日期范围分组生成子数据框

时间:2014-09-06 09:42:52

标签: r

我有一个包含两列的数据框,一个标识符和一个日期。下面的代码创建了一个示例数据框。

x <- c(rep(c("a","b"), each=10), rep(c("c", "d"), each=5))
y <- c(seq(as.Date("2014-01-01"), as.Date("2014-01-05"), by = 1), 
    as.Date("2014-03-12"), 
    as.Date("2014-03-15"),
    seq(as.Date("2014-05-11"), as.Date("2014-05-13"), by = 1),
    seq(as.Date("2014-06-11"), as.Date("2014-06-14"), by = 1),
    seq(as.Date("2014-06-01"), as.Date("2014-06-20"), by = 2),
    seq(as.Date("2014-07-31"), as.Date("2014-08-05"), by = 1))  

df <- data.frame(x = x, y = y)  

以下是df的输出。

  

x y
  1 a 2014-01-01
  2 a 2014-01-02
  3 a 2014-01-03
  4 a 2014-01-04
  5 a 2014-01-05
  6 a 2014-03-12
  7 a 2014-03-15
  8 a 2014-05-11
  。
  。
  。
  23 c 2014-06-17
  24 c 2014-06-19
  25 c 2014-07-31
  26 d 2014-08-01
  27 d 2014-08-02
  28 d 2014-08-03
  29 d 2014-08-04
  30 d 2014-08-05

我想创建另一个汇总日期范围的数据框;即,对于每个x,将为每个连续的日期集创建条目。我想要的输出(基于df中的数据)如下:

  

x start.rng end.rng days.rng
  a 2014-01-01 2014-01-05 5
  a 2014-03-12 2014-03-12 1
  a 2014-03-15 2014-03-15 1
  a 2014-05-11 2014-05-13 3
  b 2014-06-11 2014-06-14 4
  b 2014-06-01 2014-06-01 1
  b 2014-06-03 2014-06-03 1
  b 2014-06-05 2014-06-05 1
  b 2014-06-07 2014-06-07 1
  b 2014-06-09 2014-06-09 1
  b 2014-06-11 2014-06-11 1
  c 2014-06-13 2014-06-13 1
  c 2014-06-15 2014-06-15 1
  c 2014-06-17 2014-06-17 1
  c 2014-06-19 2014-06-19 1
  c 2014-07-31 2014-07-31 1
  d 2014-08-01 2014-08-05 5

我无法弄清楚如何解决这个问题。

谢谢

1 个答案:

答案 0 :(得分:1)

尝试

 res <- do.call(rbind, 
     lapply(split(df, df$x), function(.df)
         do.call(rbind, lapply(split(.df, 
cumsum(c(TRUE, diff(.df$y) != 1))), function(.x)
      data.frame(x = .x[1, 1], start.rng = .x[1, 
2], end.rng = .x[nrow(.x), 2], days.rng = nrow(.x))))))

row.names(res) <- 1:nrow(res)
head(res)
#  x  start.rng    end.rng days.rng
#1 a 2014-01-01 2014-01-05        5
#2 a 2014-03-12 2014-03-12        1
#3 a 2014-03-15 2014-03-15        1
#4 a 2014-05-11 2014-05-13        3
#5 b 2014-06-11 2014-06-14        4
#6 b 2014-06-01 2014-06-01        1

或使用data.table

library(data.table)
 DT1 <- setDT(df)[,indx:= cumsum(c(TRUE, diff(y)!=1)),
          by=x][,list(start.rng=y[1], end.rng=y[.N], days.rng=.N),
          by=list(x, indx)][, indx:=NULL] 

  head(DT1)
 #   x  start.rng    end.rng days.rng
 #1: a 2014-01-01 2014-01-05        5
 #2: a 2014-03-12 2014-03-12        1
 #3: a 2014-03-15 2014-03-15        1
 #4: a 2014-05-11 2014-05-13        3
 #5: b 2014-06-11 2014-06-14        4
 #6: b 2014-06-01 2014-06-01        1

解释

我将尝试通过拆分data.table

中的代码来解释
  • 检查每个yx中连续行值之间的差异

       setDT(df)[, #converts `df` from `data.frame` to `data.table`
        indx:=  #create an index 
      c(0, diff(y)), by=x] #calculates the difference between consecutive `y` elements
         #for each `x` group.  Here `diff` returns one element less than the length of each `x` group.  So, I appended `0` to it.  It can be any value other than `1` so that in the next step, I can use it to create a `grouping` index
    
  • 从上一步

    创建indx的分组索引
     df[, indx1:=cumsum(indx!=1), by=x] # you can check the result of this step to understand the process.  
    
  • 除了indx1之外,我们使用x作为新的分组变量,我们会检查first的{​​{1}}和last值< / p>

    y
  • 如果您不想要列 df1 <- df[, list(start.rng=y[1], #first y value end.rng=y[.N], #last y value .N signifies the length of each group day.rng=.N), #group length i.e. .N by=list(x, indx1)] #grouped by x and indx1

    indx1