我感谢任何帮助。应该有一个相当简单的解决方案,我的尝试在这一点上不是超级优雅。基本上,这就是我所拥有的,尽管实际的集合有35K行:
> x <- c("A","A","A","A","A","B","B","B","B", "B", "C")
> y <- c("01/01/2000", "01/02/2000", "01/03/2000", "01/10/2000", "01/11/2000","01/05/2000", "01/07/2000", "01/08/2000", "01/17/2000", "01/18/2000", "01/01/2000")
> y <- as.Date(y, "%m/%d/%Y")
> df <- data.frame(x, y)
> df
x y
1 A 2000-01-01
2 A 2000-01-02
3 A 2000-01-03
4 A 2000-01-10
5 A 2000-01-11
6 B 2000-01-05
7 B 2000-01-07
8 B 2000-01-08
9 B 2000-01-17
10 B 2000-01-18
11 C 2000-01-01
对于每个x,我希望连续日期具有相同的数字,加上下一系列日期的一个。基本上,这就是我想要的:
> df2
x z y
1 A 1 2000-01-01
2 A 1 2000-01-02
3 A 1 2000-01-03
4 A 2 2000-01-10
5 A 2 2000-01-11
6 B 1 2000-01-05
7 B 2 2000-01-07
8 B 2 2000-01-08
9 B 3 2000-01-17
10 B 3 2000-01-18
11 C 1 2000-01-01
或者这个输出可行:
xz min max
A1 2000-01-01 2000-01-03
A2 2000-01-10 2000-01-11
B1 2000-01-05 2000-01-05
B2 2000-01-07 2000-01-08
B3 2000-01-17 2000-01-18
C1 2000-01-01 2000-01-01
谢谢!
答案 0 :(得分:3)
以下是使用rle
,diff
和data.table
library(data.table)
# make df a data.table
setDT(df)
define_grp <- function(x) {
# run length encoding on difference
xx <- rle(as.numeric(diff(x)))
# replace values with logical vector for not 1
xx$values <- xx$values!=1
# an appropriate cumulative sum (starting with TRUE)
cumsum(c(TRUE,inverse.rle(xx)))
}
df[,z := define_grp(y),by=x]
df
# x y z
# 1: A 2000-01-01 1
# 2: A 2000-01-02 1
# 3: A 2000-01-03 1
# 4: A 2000-01-10 2
# 5: A 2000-01-11 2
# 6: B 2000-01-05 1
# 7: B 2000-01-07 2
# 8: B 2000-01-08 2
# 9: B 2000-01-17 3
# 10: B 2000-01-18 3
# 11: C 2000-01-01 1