如何根据条件创建组

时间:2017-08-21 18:04:54

标签: r

我有这样的数据:

set.seed(12345)

df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
             int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
             out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))         

这是期望的输出:

   group size      int      out  id  id2
1      A 1000 5.585529 0.000000   1    1
2      A 1000 5.709466 0.000000   1    1
3      A 1000 4.890697 0.000000   1    1
4      A 1000 0.000000 0.000000   1    1
5      A 1000 0.000000 0.000000   1    1
6      A    0 0.000000 4.080678   1    1
7      A    0 0.000000 4.883752   NA   1
8      A    0 0.000000 6.817312   NA   1
9      A 1000 4.546503 0.000000   2    2
10     A 1000 5.605887 0.000000   2    2
11     A 1000 3.182044 0.000000   2    2
12     A 1000 0.000000 0.000000   2    2
13     A 1000 0.000000 0.000000   2    2
14     A 1000 0.000000 0.000000   2    2
15     A 1000 0.000000 0.000000   2    2
16     A    0 0.000000 5.370628   2    2
17     A    0 0.000000 5.520216   NA   2
18     A    0 0.000000 4.249468   NA   2
19     A 1000 5.630099 0.000000   3    3 
20     A 1000 4.723816 0.000000   3    3
21     A 1000 4.715840 0.000000   3    3
22     A 1000 0.000000 0.000000   3    3
23     A 1000 0.000000 0.000000   3    3
24     A    0 0.000000 5.816900   3    3
25     A    0 0.000000 4.113642   NA   3
26     A    0 0.000000 4.668422   NA   3

根据上述数据创建新组id。我相信rle功能是可行的方法,但我无法弄清楚到底。

2 个答案:

答案 0 :(得分:3)

@ ycw答案的变体:

library(data.table)
setDT(df)

df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]

    group size      int      out  g
 1:     A 1000 5.585529 0.000000  1
 2:     A 1000 5.709466 0.000000  1
 3:     A 1000 4.890697 0.000000  1
 4:     A 1000 0.000000 0.000000  1
 5:     A 1000 0.000000 0.000000  1
 6:     A    0 0.000000 4.080678  1
 7:     A    0 0.000000 4.883752 NA
 8:     A    0 0.000000 6.817312 NA
 9:     A 2000 4.546503 0.000000  3
10:     A 2000 5.605887 0.000000  3
11:     A 2000 3.182044 0.000000  3
12:     A 2000 0.000000 0.000000  3
13:     A 2000 0.000000 0.000000  3
14:     A 2000 0.000000 0.000000  3
15:     A 2000 0.000000 0.000000  3
16:     A    0 0.000000 5.370628  3
17:     A    0 0.000000 5.520216 NA
18:     A    0 0.000000 4.249468 NA
19:     A 5000 5.630099 0.000000  5
20:     A 5000 4.723816 0.000000  5
21:     A 5000 4.715840 0.000000  5
22:     A 5000 0.000000 0.000000  5
23:     A 5000 0.000000 0.000000  5
24:     A    0 0.000000 5.816900  5
25:     A    0 0.000000 4.113642 NA
26:     A    0 0.000000 4.668422 NA
    group size      int      out  g

(@ ycw建议我单独回答。另外,NA^x技巧是从@akrun借用的。)

对于OP的组号,这个额外的步骤有效:

df[, g := match(g, unique(na.omit(g)))]

对于扩展,OP添加(“id2”):

w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]

所以最后我们有......

    group size      int      out  g g2
 1:     A 1000 5.585529 0.000000  1  1
 2:     A 1000 5.709466 0.000000  1  1
 3:     A 1000 4.890697 0.000000  1  1
 4:     A 1000 0.000000 0.000000  1  1
 5:     A 1000 0.000000 0.000000  1  1
 6:     A    0 0.000000 4.080678  1  1
 7:     A    0 0.000000 4.883752 NA  1
 8:     A    0 0.000000 6.817312 NA  1
 9:     A 2000 4.546503 0.000000  2  2
10:     A 2000 5.605887 0.000000  2  2
11:     A 2000 3.182044 0.000000  2  2
12:     A 2000 0.000000 0.000000  2  2
13:     A 2000 0.000000 0.000000  2  2
14:     A 2000 0.000000 0.000000  2  2
15:     A 2000 0.000000 0.000000  2  2
16:     A    0 0.000000 5.370628  2  2
17:     A    0 0.000000 5.520216 NA  2
18:     A    0 0.000000 4.249468 NA  2
19:     A 5000 5.630099 0.000000  3  3
20:     A 5000 4.723816 0.000000  3  3
21:     A 5000 4.715840 0.000000  3  3
22:     A 5000 0.000000 0.000000  3  3
23:     A 5000 0.000000 0.000000  3  3
24:     A    0 0.000000 5.816900  3  3
25:     A    0 0.000000 4.113642 NA  3
26:     A    0 0.000000 4.668422 NA  3
    group size      int      out  g g2

对于基本R类似物,有一个关于如何使rleid没有data.table的SO Q&amp; A;可以手动构造shift(它只是一个滞后运算符);还有其他方法可以找到w(也许tapply?)。

答案 1 :(得分:2)

以下是使用dplyrrleid包中的data.table函数的选项。 dt2是最终输出。

library(dplyr)
library(data.table)

df2 <- df %>%
  mutate(non_zero = ifelse(size != 0, 1, 0)) %>%
  mutate(runID = rleid(non_zero)) %>%
  mutate(runID = ifelse(runID %% 2 != 0, (runID + 1)/2, runID/2)) %>%
  group_by(runID) %>%
  mutate(id = ifelse(row_number() %in% n():(n() - 1), NA, runID)) %>%
  ungroup() %>%
  select(group, size, int, out, id, id2 = runID)