Question

我有一些带有ID，年份和变量的面板数据，该变量指示在那个时间点是否对个体进行了治疗：

id  year   treated  
1   2000      0            
1   2001      0            
1   2002      1            
1   2003      1            
1   2004      1

我需要创建一个虚拟对象以指示治疗首次发生的年份。所需的输出类似于：

id  year   treated   treatment_year
1   2000      0            0
1   2001      0            0
1   2002      1            1
1   2003      1            0
1   2004      1            0

对我来说，这似乎很简单，但是我被困了一段时间，无法获得任何排序功能来执行此操作。非常感谢您的帮助

Answer 1

您可以使用match来获取每个id中前1个索引，并且用0代替所有内容。

这可以使用dplyr来完成：

library(dplyr)
df %>%
  group_by(id) %>%
  mutate(treatment_year = replace(treated, -match(1L, treated), 0L))
  #Can also use : 
  #mutate(treatment_year = +(row_number() == match(1L, treated)))

#     id  year treated treatment_year
#  <int> <int>   <int>          <int>
#1     1  2000       0              0
#2     1  2001       0              0
#3     1  2002       1              1
#4     1  2003       1              0
#5     1  2004       1              0

基数R：

df$treatment_year <- with(df, ave(treated, id, FUN = function(x) 
                          replace(x, -match(1L, x), 0L)))

和data.table：

library(data.table)
setDT(df)[, treatment_year := replace(treated, -match(1L, treated), 0L), id]

其工作原理的说明。

match返回匹配的第一个索引。考虑这个例子

x <- c(0, 0, 1, 1, 1)
match(1, x)
#[1] 3

在第3个位置，我们找到第一个1。通过向其添加-，我们将忽略该索引，而将replace的所有其他值都设为0。

replace(x, -match(1, x), 0)
#[1] 0 0 1 0 0

如果x总是具有1/0值，而x总是具有至少1个值，我们也可以使用which.max代替match。

which.max(x)
#[1] 3

Answer 2

我们可以使用row_number和which.max创建逻辑索引并将其强制转换为二进制

library(dplyr)
df1 %>% 
   group_by(id) %>% 
   mutate(treatment_year = +(row_number() == which.max(treated)))
# A tibble: 5 x 4
# Groups:   id [1]
#     id  year treated treatment_year
#  <int> <int>   <int>          <int>
#1     1  2000       0              0
#2     1  2001       0              0
#3     1  2002       1              1
#4     1  2003       1              0
#5     1  2004       1              0

或使用duplicated

创建逻辑表达式

df1 %>%
    group_by(id) %>%
    mutate(treatment_year = +(!duplicated(treated) & as.logical(treated)))

数据

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L), year = 2000:2004, 
    treated = c(0L, 0L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-5L))

在R中找到治疗的第一个日期

2 个答案:

数据