我正在尝试对数据进行子集化,因此它只保留第一次出现的变量。我正在查看跟踪工人职业生涯的面板数据,我正在尝试对数据进行分组,以便只有在每个人都成为Boss之后才会显示。
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
所以我希望数据看起来像:
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
2 1990 Jane Manager 0
2 1991 Jane Boss 1
这似乎是基本的审查,但为了我的分析,这是至关重要的..!任何帮助将不胜感激。
答案 0 :(得分:4)
这是一个使用两个有用的window functions lag()
和cumall()
的dplyr解决方案:
df <- read.table(header = TRUE, text = "
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
", stringsAsFactors = FALSE)
library(dplyr)
# Use mutate to see the values of the new variables
df %>%
group_by(id) %>%
mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))
# Use filter to see the results
df %>%
group_by(id) %>%
filter(cumall(lag(job, default = "") != "Boss"))
我们使用lag()
来确定每个人上一年的工作,然后使用cumall()
将所有行保持到“Boss”的第一个实例。如果数据尚未按年份排序,您可以使用lag(job, order_by = year)
确保lag()
使用年份值而不是行顺序来确定哪个是“最后一年”。
答案 1 :(得分:3)
基础解决方案:
do.call(
rbind,
by(dat,dat$name,function(x) {
if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
})
)
# id year name job job2
#Bon.1 1 1990 Bon Manager 0
#Bon.2 1 1991 Bon Manager 0
#Bon.3 1 1992 Bon Manager 0
#Bon.4 1 1993 Bon Boss 1
#Jane.6 2 1990 Jane Manager 0
#Jane.7 2 1991 Jane Boss 1
另一种基本解决方案:
dat$keep <- with(dat,
ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2)
)
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )
# id year name job job2 keep
#1 1 1990 Bon Manager 0 0
#2 1 1991 Bon Manager 0 0
#3 1 1992 Bon Manager 0 0
#4 1 1993 Bon Boss 1 1
#6 2 1990 Jane Manager 0 0
#7 2 1991 Jane Boss 1 1
一个data.table
解决方案:
dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]
# name id year job job2
#1: Bon 1 1990 Manager 0
#2: Bon 1 1991 Manager 0
#3: Bon 1 1992 Manager 0
#4: Bon 1 1993 Boss 1
#5: Jane 2 1990 Manager 0
#6: Jane 2 1991 Boss 1
答案 2 :(得分:2)
图书馆'sqldf'可以完成这项工作。
library(sqldf)
miny <- sqldf("select id, min(year) as year from df where job='Boss' group by id")
sqldf("select df.* from df join miny on (df.id=miny.id and df.year<=miny.year)")
答案 3 :(得分:1)
如果您的数据存储在名为df
的数据框中:
library(plyr)
ddply(.data=df, .variables=c("name"), .fun=function(x) {
i <- which(x$job == "Boss")[1]
if (!is.na(i)) x[1:i, ] # omit lifelong managers
})
# id year name job job2
# 1 1 1990 Bon Manager 0
# 2 1 1991 Bon Manager 0
# 3 1 1992 Bon Manager 0
# 4 1 1993 Bon Boss 1
# 5 2 1990 Jane Manager 0
# 6 2 1991 Jane Boss 1