随意编辑此标题,使其更易于理解/推广......
我有一个 data.table 对象,其中3列形成了组(id
,id2
pol_loc
)。在这些组中是行观察,每个组的某行会有一个星号或NA
。我想有效地为行星的每一组制作一个指示栏,相对于星号(在-1之前,在0之后)。这是数据表的样子:
id id2 pol_loc non_pol cluster_tag
1: 1 1 3 do NA
2: 1 1 3 you NA
3: 1 1 3 * NA
4: 1 1 3 it NA
-------------------------------------
5: 1 2 3 but 4
6: 1 2 3 i NA
7: 1 2 3 * NA
8: 1 2 3 really 2
9: 1 2 3 bad NA
-------------------------------------
10: 1 2 5 but 4
11: 1 2 5 i NA
12: 1 2 5 hate NA
13: 1 2 5 really 2
14: 1 2 5 * NA
15: 1 2 5 dogs NA
-------------------------------------
16: 2 1 4 i NA
17: 2 1 4 am NA
18: 2 1 4 the NA
19: 2 1 4 * NA
20: 2 1 4 friend NA
-------------------------------------
21: 3 1 4 do NA
22: 3 1 4 you NA
23: 3 1 4 really 2
24: 3 1 4 * NA
-------------------------------------
25: 3 2 NA NA NA
id id2 pol_loc non_pol cluster_tag
期望的输出:
这是所需的输出:
id id2 pol_loc non_pol cluster_tag before
1: 1 1 3 do NA 1
2: 1 1 3 you NA 1
3: 1 1 3 * NA NA
4: 1 1 3 it NA 0
----------------------------------------------
5: 1 2 3 but 4 1
6: 1 2 3 i NA 1
7: 1 2 3 * NA NA
8: 1 2 3 really 2 0
9: 1 2 3 bad NA 0
----------------------------------------------
10: 1 2 5 but 4 1
11: 1 2 5 i NA 1
12: 1 2 5 hate NA 1
13: 1 2 5 really 2 1
14: 1 2 5 * NA NA
15: 1 2 5 dogs NA 0
----------------------------------------------
16: 2 1 4 i NA 1
17: 2 1 4 am NA 1
18: 2 1 4 the NA 1
19: 2 1 4 * NA NA
20: 2 1 4 friend NA 0
----------------------------------------------
21: 3 1 4 do NA 1
22: 3 1 4 you NA 1
23: 3 1 4 really 2 1
24: 3 1 4 * NA NA
----------------------------------------------
25: 3 2 NA NA NA NA
id id2 pol_loc non_pol cluster_tag before
MWE
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L),
id2 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), pol_loc = c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 5L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, NA), non_pol = c("do", "you",
"*", "it", "but", "i", "*", "really", "bad", "but", "i",
"hate", "really", "*", "dogs", "i", "am", "the", "*", "friend",
"do", "you", "really", "*", NA), cluster_tag = c(NA, NA,
NA, NA, "4", NA, NA, "2", NA, "4", NA, NA, "2", NA, NA, NA,
NA, NA, NA, NA, NA, NA, "2", NA, NA)), row.names = c(NA,
-25L), class = "data.frame", .Names = c("id", "id2", "pol_loc",
"non_pol", "cluster_tag"))
library(data.table)
setDT(dat)
编辑如果它更容易或更有效,NA
可以变为0
或1
它没有任何区别我猜这是更多高效。
答案 0 :(得分:5)
尝试
dat[, before:=1-cumsum(non_pol=="*"), by=.(id, id2, pol_loc)][non_pol=="*", before:=NA,]