非连续日期的单独剧集

时间:2014-05-01 05:46:28

标签: r date

我感谢任何帮助。应该有一个相当简单的解决方案,我的尝试在这一点上不是超级优雅。基本上,这就是我所拥有的,尽管实际的集合有35K行:

> x <- c("A","A","A","A","A","B","B","B","B", "B", "C")
> y <- c("01/01/2000", "01/02/2000", "01/03/2000", "01/10/2000", "01/11/2000","01/05/2000", "01/07/2000", "01/08/2000", "01/17/2000", "01/18/2000", "01/01/2000")
> y <- as.Date(y, "%m/%d/%Y")
> df <- data.frame(x, y)
> df
   x          y
1  A 2000-01-01
2  A 2000-01-02
3  A 2000-01-03
4  A 2000-01-10
5  A 2000-01-11
6  B 2000-01-05
7  B 2000-01-07
8  B 2000-01-08
9  B 2000-01-17
10 B 2000-01-18
11 C 2000-01-01

对于每个x,我希望连续日期具有相同的数字,加上下一系列日期的一个。基本上,这就是我想要的:

> df2
   x z          y
1  A 1 2000-01-01
2  A 1 2000-01-02
3  A 1 2000-01-03
4  A 2 2000-01-10
5  A 2 2000-01-11
6  B 1 2000-01-05
7  B 2 2000-01-07
8  B 2 2000-01-08
9  B 3 2000-01-17
10 B 3 2000-01-18
11 C 1 2000-01-01

或者这个输出可行:

xz min        max
A1 2000-01-01 2000-01-03
A2 2000-01-10 2000-01-11
B1 2000-01-05 2000-01-05
B2 2000-01-07 2000-01-08
B3 2000-01-17 2000-01-18
C1 2000-01-01 2000-01-01  

谢谢!

1 个答案:

答案 0 :(得分:3)

以下是使用rlediffdata.table

的方法
library(data.table)
# make df a data.table
setDT(df)

define_grp <- function(x) {
  # run length encoding on difference
  xx <- rle(as.numeric(diff(x)))
  # replace values with logical vector for not 1
  xx$values <- xx$values!=1
  # an appropriate cumulative sum (starting with TRUE)
   cumsum(c(TRUE,inverse.rle(xx)))
}

df[,z := define_grp(y),by=x]
 df
#     x          y z
#  1: A 2000-01-01 1
#  2: A 2000-01-02 1
#  3: A 2000-01-03 1
#  4: A 2000-01-10 2
#  5: A 2000-01-11 2
#  6: B 2000-01-05 1
#  7: B 2000-01-07 2
#  8: B 2000-01-08 2
#  9: B 2000-01-17 3
# 10: B 2000-01-18 3
# 11: C 2000-01-01 1