根据条件和组更改列的值

时间:2019-03-31 06:00:07

标签: r data.table

我的数据如下:

     year month flag group
 1: 1992     6    1     8
 2: 1992     7    0     8
 3: 1992     8    0     8
 4: 1992     9    0     8
 5: 1992    10    0     8
 6: 1992    11    0     8
 7: 1992    12    0     8
 8: 1995     6    0    10
 9: 1995     7    0    11
10: 1995     8    0    11
11: 1995     9    1    11
12: 1995    10    0    11
13: 1995    11    0    11
14: 1995    12    0    11
15: 1998     6    0    13
16: 1998     7    0    13
17: 1998     8    0    13
18: 1998     9    0    13
19: 1998    10    0    13
20: 1998    11    0    13
21: 1998    12    0    13

我需要做的是为flag列中第一次观察到1的所有行赋值为1,但是这也需要由group完成。

作为一个具体的例子,我想要这个:

     year month flag group
 1: 1992     6    1     8
 2: 1992     7    1     8
 3: 1992     8    1     8
 4: 1992     9    1     8
 5: 1992    10    1     8
 6: 1992    11    1     8
 7: 1992    12    1     8
 8: 1995     6    0    10
 9: 1995     7    0    11
10: 1995     8    0    11
11: 1995     9    1    11
12: 1995    10    1    11
13: 1995    11    1    11
14: 1995    12    1    11
15: 1998     6    0    13
16: 1998     7    0    13
17: 1998     8    0    13
18: 1998     9    0    13
19: 1998    10    0    13
20: 1998    11    0    13
21: 1998    12    0    13

注意第1行:7现在是第1行以及第11行第14行,并注意第15行第21行如何变化,看看最初没有第1行。

我的大多数想法都围绕着使用which来按组找出前1个索引,但是我遇到了一些麻烦。

如果有人有任何基于data.table()的解决方案,那就太好了。

感谢您的帮助!

如果有帮助,这里是我的基本数据的dput()

library(data.table)

DT = setDT(structure(list(year = c(1992, 1992, 1992, 1992, 1992, 1992, 1992, 
1992, 1992, 1992, 1992, 1992, 1995, 1995, 1995, 1995, 1995, 1995, 
1995, 1995, 1995, 1995, 1995, 1995, 1998, 1998, 1998, 1998, 1998, 
1998, 1998, 1998, 1998, 1998, 1998, 1998), month = c(1, 2, 3, 
4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), flag = c(0, 0, 
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), group = c(8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 10L, 10L, 10L, 10L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 11L, 13L, 13L, 13L, 13L, 13L, 13L, 
13L, 13L, 13L, 13L, 13L, 13L)), row.names = c(NA, -36L), 
class = c("data.table", "data.frame")))

4 个答案:

答案 0 :(得分:2)

对于第一个出现的行,我们返回1,其中flag = 1并且该组至少有一个flag = 1

library(data.table)
dt[,flag := +(seq_len(.N)>= which.max(flag == 1) & any(flag == 1)),by = group]

dt

#    year month flag group
# 1: 1992     6    1     8
# 2: 1992     7    1     8
# 3: 1992     8    1     8
# 4: 1992     9    1     8
# 5: 1992    10    1     8
# 6: 1992    11    1     8
# 7: 1992    12    1     8
# 8: 1995     6    0    10
# 9: 1995     7    0    11
#10: 1995     8    0    11
#11: 1995     9    1    11
#12: 1995    10    1    11
#13: 1995    11    1    11
#14: 1995    12    1    11
#15: 1998     6    0    13
#16: 1998     7    0    13
#17: 1998     8    0    13
#18: 1998     9    0    13
#19: 1998    10    0    13
#20: 1998    11    0    13
#21: 1998    12    0    13
#    year month flag group

dplyr中应该是

library(dplyr)
dt %>%
   group_by(group) %>%
   mutate(flag = +(row_number() >= which.max(flag == 1) & any(flag == 1)))

,在基数R中使用ave将是

dt$flag <- with(dt, +(ave(flag == 1, group, FUN = function(x) 
                     seq_along(x) >= which.max(x) & any(x))))

数据

dt <- structure(list(year = c(1992, 1992, 1992, 1992, 1992, 1992, 1992, 
1992, 1992, 1992, 1992, 1992, 1995, 1995, 1995, 1995, 1995, 1995, 
1995, 1995, 1995, 1995, 1995, 1995, 1998, 1998, 1998, 1998, 1998, 
1998, 1998, 1998, 1998, 1998, 1998, 1998), month = c(1, 2, 3, 
4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), flag = c(0, 0, 
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), group = c(8L, 8L, 8L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 10L, 10L, 10L, 10L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 11L, 13L, 13L, 13L, 13L, 13L, 13L, 
13L, 13L, 13L, 13L, 13L, 13L)), row.names = c(NA, -36L), class = 
c("data.table","data.frame"))

答案 1 :(得分:1)

您可以在每个组的第一个月进行非股权加入:

DT[unique(DT[flag==1], by="group"), on=.(group, month >= month), flag := 1]

这是来自OP的dput的结果:

    year month flag group
 1: 1992     1    0     8
 2: 1992     2    0     8
 3: 1992     3    0     8
 4: 1992     4    0     8
 5: 1992     5    0     8
 6: 1992     6    1     8
 7: 1992     7    1     8
 8: 1992     8    1     8
 9: 1992     9    1     8
10: 1992    10    1     8
11: 1992    11    1     8
12: 1992    12    1     8
13: 1995     1    0    10
14: 1995     2    0    10
15: 1995     3    0    10
16: 1995     4    0    10
17: 1995     5    0    10
18: 1995     6    0    10
19: 1995     7    0    11
20: 1995     8    0    11
21: 1995     9    1    11
22: 1995    10    1    11
23: 1995    11    1    11
24: 1995    12    1    11
25: 1998     1    0    13
26: 1998     2    0    13
27: 1998     3    0    13
28: 1998     4    0    13
29: 1998     5    0    13
30: 1998     6    0    13
31: 1998     7    0    13
32: 1998     8    0    13
33: 1998     9    0    13
34: 1998    10    0    13
35: 1998    11    0    13
36: 1998    12    0    13
    year month flag group

答案 2 :(得分:0)

您可以使用dplyrcumsum

library(dplyr)
df %>%
  group_by(group) %>%
  mutate(flag = ifelse(cumsum(flag) > 1, 1, 0))

另一种方法是使用lag

df %>%
  group_by(group) %>%
  mutate(flag = ifelse(flag != 1 & row_number() > 1, lag(flag, 1), flag)) 

或在data.table中为:

df[, flag := ifelse(cumsum(flag) > 1, 1, 0), by=group]

答案 3 :(得分:0)

使用na.locf()包中的zoo

第1步:过滤包含至少一个“ 1”的组,并用NA替换其中的“ 0”

第2步:使用na.locf()将最新的非NA值拖到下面的所有内容

library(zoo)
library(data.table)

temp[group %in% temp[,max(flag),.(group)][V1==1]$group & flag == 0,flag:= NA][,flag:=na.locf(flag,na.rm = FALSE)]

输入表(温度)

structure(list(year = c(1992, 1992, 1992, 1992, 1992, 1992, 1992, 
1995, 1995, 1995, 1995, 1995, 1995, 1995, 1998, 1998, 1998, 1998, 
1998, 1998, 1998), month = c(6, 7, 8, 9, 10, 11, 12, 6, 7, 8, 
9, 10, 11, 12, 6, 7, 8, 9, 10, 11, 12), flag = c(1, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), group = c(8L, 
8L, 8L, 8L, 8L, 8L, 8L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 13L, 
13L, 13L, 13L, 13L, 13L, 13L)), row.names = c(NA, -21L), class = c("data.table", 
"data.frame"))

输出表

structure(list(year = c(1992, 1992, 1992, 1992, 1992, 1992, 1992, 
1995, 1995, 1995, 1995, 1995, 1995, 1995, 1998, 1998, 1998, 1998, 
1998, 1998, 1998), month = c(6, 7, 8, 9, 10, 11, 12, 6, 7, 8, 
9, 10, 11, 12, 6, 7, 8, 9, 10, 11, 12), flag = c(1, 1, 1, 1, 
1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0), group = c(8L, 
8L, 8L, 8L, 8L, 8L, 8L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 13L, 
13L, 13L, 13L, 13L, 13L, 13L)), row.names = c(NA, -21L), class = c("data.table", 
"data.frame"))