我有这样的数据:
set.seed(12345)
df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))
这是期望的输出:
group size int out id id2
1 A 1000 5.585529 0.000000 1 1
2 A 1000 5.709466 0.000000 1 1
3 A 1000 4.890697 0.000000 1 1
4 A 1000 0.000000 0.000000 1 1
5 A 1000 0.000000 0.000000 1 1
6 A 0 0.000000 4.080678 1 1
7 A 0 0.000000 4.883752 NA 1
8 A 0 0.000000 6.817312 NA 1
9 A 1000 4.546503 0.000000 2 2
10 A 1000 5.605887 0.000000 2 2
11 A 1000 3.182044 0.000000 2 2
12 A 1000 0.000000 0.000000 2 2
13 A 1000 0.000000 0.000000 2 2
14 A 1000 0.000000 0.000000 2 2
15 A 1000 0.000000 0.000000 2 2
16 A 0 0.000000 5.370628 2 2
17 A 0 0.000000 5.520216 NA 2
18 A 0 0.000000 4.249468 NA 2
19 A 1000 5.630099 0.000000 3 3
20 A 1000 4.723816 0.000000 3 3
21 A 1000 4.715840 0.000000 3 3
22 A 1000 0.000000 0.000000 3 3
23 A 1000 0.000000 0.000000 3 3
24 A 0 0.000000 5.816900 3 3
25 A 0 0.000000 4.113642 NA 3
26 A 0 0.000000 4.668422 NA 3
根据上述数据创建新组id
。我相信rle
功能是可行的方法,但我无法弄清楚到底。
答案 0 :(得分:3)
@ ycw答案的变体:
library(data.table)
setDT(df)
df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]
group size int out g
1: A 1000 5.585529 0.000000 1
2: A 1000 5.709466 0.000000 1
3: A 1000 4.890697 0.000000 1
4: A 1000 0.000000 0.000000 1
5: A 1000 0.000000 0.000000 1
6: A 0 0.000000 4.080678 1
7: A 0 0.000000 4.883752 NA
8: A 0 0.000000 6.817312 NA
9: A 2000 4.546503 0.000000 3
10: A 2000 5.605887 0.000000 3
11: A 2000 3.182044 0.000000 3
12: A 2000 0.000000 0.000000 3
13: A 2000 0.000000 0.000000 3
14: A 2000 0.000000 0.000000 3
15: A 2000 0.000000 0.000000 3
16: A 0 0.000000 5.370628 3
17: A 0 0.000000 5.520216 NA
18: A 0 0.000000 4.249468 NA
19: A 5000 5.630099 0.000000 5
20: A 5000 4.723816 0.000000 5
21: A 5000 4.715840 0.000000 5
22: A 5000 0.000000 0.000000 5
23: A 5000 0.000000 0.000000 5
24: A 0 0.000000 5.816900 5
25: A 0 0.000000 4.113642 NA
26: A 0 0.000000 4.668422 NA
group size int out g
(@ ycw建议我单独回答。另外,NA^x
技巧是从@akrun借用的。)
对于OP的组号,这个额外的步骤有效:
df[, g := match(g, unique(na.omit(g)))]
对于扩展,OP添加(“id2”):
w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]
所以最后我们有......
group size int out g g2
1: A 1000 5.585529 0.000000 1 1
2: A 1000 5.709466 0.000000 1 1
3: A 1000 4.890697 0.000000 1 1
4: A 1000 0.000000 0.000000 1 1
5: A 1000 0.000000 0.000000 1 1
6: A 0 0.000000 4.080678 1 1
7: A 0 0.000000 4.883752 NA 1
8: A 0 0.000000 6.817312 NA 1
9: A 2000 4.546503 0.000000 2 2
10: A 2000 5.605887 0.000000 2 2
11: A 2000 3.182044 0.000000 2 2
12: A 2000 0.000000 0.000000 2 2
13: A 2000 0.000000 0.000000 2 2
14: A 2000 0.000000 0.000000 2 2
15: A 2000 0.000000 0.000000 2 2
16: A 0 0.000000 5.370628 2 2
17: A 0 0.000000 5.520216 NA 2
18: A 0 0.000000 4.249468 NA 2
19: A 5000 5.630099 0.000000 3 3
20: A 5000 4.723816 0.000000 3 3
21: A 5000 4.715840 0.000000 3 3
22: A 5000 0.000000 0.000000 3 3
23: A 5000 0.000000 0.000000 3 3
24: A 0 0.000000 5.816900 3 3
25: A 0 0.000000 4.113642 NA 3
26: A 0 0.000000 4.668422 NA 3
group size int out g g2
对于基本R类似物,有一个关于如何使rleid
没有data.table的SO Q&amp; A;可以手动构造shift
(它只是一个滞后运算符);还有其他方法可以找到w
(也许tapply
?)。
答案 1 :(得分:2)
以下是使用dplyr
和rleid
包中的data.table
函数的选项。 dt2
是最终输出。
library(dplyr)
library(data.table)
df2 <- df %>%
mutate(non_zero = ifelse(size != 0, 1, 0)) %>%
mutate(runID = rleid(non_zero)) %>%
mutate(runID = ifelse(runID %% 2 != 0, (runID + 1)/2, runID/2)) %>%
group_by(runID) %>%
mutate(id = ifelse(row_number() %in% n():(n() - 1), NA, runID)) %>%
ungroup() %>%
select(group, size, int, out, id, id2 = runID)