如何在R中填充两个因子之间的值?

时间:2016-06-23 20:59:40

标签: r data.table dplyr stata

如何填写'期限'列之间有一个' start'并且'结束'指标如下例所示?

在Stata中,它将是:

by id (year), sort: gen duration=1 if start==1
by id (year), sort: replace duration=1 if duration[_n-1]==1 & end!=1

我怎么能在R中这样做,可能使用Dplyr?

id  year    start   end 
1   2000    0       0   
1   2001    1       0   
1   2002    0       0   
1   2003    0       1   
1   2004    0       0   
2   2000    0       0   
2   2001    0       0   
2   2002    1       0   
2   2003    0       0   
2   2004    0       1   

输出将是:

id  year    start   end duration
1   2000    0       0   0
1   2001    1       0   1
1   2002    0       0   1
1   2003    0       1   0
1   2004    0       0   0
2   2000    0       0   0
2   2001    0       0   0
2   2002    1       0   1
2   2003    0       0   1
2   2004    0       1   0

3 个答案:

答案 0 :(得分:4)

使用dplyr,这似乎可以解决问题。首先,样本数据

dd<-read.table(text="id  year    start   end 
1   2000    0       0   
1   2001    1       0   
1   2002    0       0   
1   2003    0       1   
1   2004    0       0   
2   2000    0       0   
2   2001    0       0   
2   2002    1       0   
2   2003    0       0   
2   2004    0       1", header=T)

现在我们只按ID进行分组,然后我们使用cumsum来查找开始和结束时的更改

library(dplyr)
dd %>% group_by(id) %>% mutate(duration = cumsum(start-end))

#       id  year start   end duration
#    (int) (int) (int) (int)    (int)
# 1      1  2000     0     0        0
# 2      1  2001     1     0        1
# 3      1  2002     0     0        1
# 4      1  2003     0     1        0
# 5      1  2004     0     0        0
# 6      2  2000     0     0        0
# 7      2  2001     0     0        0
# 8      2  2002     1     0        1
# 9      2  2003     0     0        1
# 10     2  2004     0     1        0

答案 1 :(得分:1)

对您提供的代码使用类似的逻辑:

#Load dplyr
require(dplyr)

#Make data
df <- data.frame("id" = c(1,1,1,1,1,2,2,2,2,2),
             "year" = c(2000, 2001, 2002, 2003, 2004, 2000, 2001, 2002, 2003, 2004),
             "start" = c(0,1,0,0,0,0,0,1,0,0),
             "end" = c(0,0,0,1,0,0,0,0,0,1))

#Order by Year and ID
df <- df[order(df$id,df$year),]

#Make new variable
df$duration <- 0
df$duration[df$start==1 & df$end != 1] <- 1
df$duration[lag(df$duration,1)==1 & df$end ==0] <-1

答案 2 :(得分:1)

我们可以使用base R

df1$duration <- with(df1, ave(start-end, id, FUN = cumsum))
df1
#   id year start end duration
#1   1 2000     0   0        0
#2   1 2001     1   0        1
#3   1 2002     0   0        1
#4   1 2003     0   1        0
#5   1 2004     0   0        0
#6   2 2000     0   0        0
#7   2 2001     0   0        0
#8   2 2002     1   0        1
#9   2 2003     0   0        1
#10  2 2004     0   1        0