将数据子集化到R中的第一次出现

时间:2014-02-07 05:14:13

标签: r

我正在尝试对数据进行子集化,因此它只保留第一次出现的变量。我正在查看跟踪工人职业生涯的面板数据,我正在尝试对数据进行分组,以便只有在每个人都成为Boss之后才会显示。

id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1

所以我希望数据看起来像:

id  year    name    job   job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1

这似乎是基本的审查,但为了我的分析,这是至关重要的..!任何帮助将不胜感激。

4 个答案:

答案 0 :(得分:4)

这是一个使用两个有用的window functions lag()cumall()的dplyr解决方案:

df <- read.table(header = TRUE, text = "
id  year    name    job    job2
1   1990    Bon     Manager 0
1   1991    Bon     Manager 0
1   1992    Bon     Manager 0
1   1993    Bon     Boss    1
1   1994    Bon     Manager 0
2   1990    Jane    Manager 0
2   1991    Jane    Boss    1
2   1992    Jane    Manager 0
2   1993    Jane    Boss    1
", stringsAsFactors = FALSE)

library(dplyr)

# Use mutate to see the values of the new variables
df %>% 
  group_by(id) %>%
  mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))

# Use filter to see the results
df %>% 
  group_by(id) %>%
  filter(cumall(lag(job, default = "") != "Boss"))

我们使用lag()来确定每个人上一年的工作,然后使用cumall()将所有行保持到“Boss”的第一个实例。如果数据尚未按年份排序,您可以使用lag(job, order_by = year)确保lag()使用年份值而不是行顺序来确定哪个是“最后一年”。

答案 1 :(得分:3)

基础解决方案:

do.call(
  rbind,
  by(dat,dat$name,function(x) {
    if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
  })
)

#       id year name     job job2
#Bon.1   1 1990  Bon Manager    0
#Bon.2   1 1991  Bon Manager    0
#Bon.3   1 1992  Bon Manager    0
#Bon.4   1 1993  Bon    Boss    1
#Jane.6  2 1990 Jane Manager    0
#Jane.7  2 1991 Jane    Boss    1

另一种基本解决方案:

dat$keep <- with(dat, 
             ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2) 
            )
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )

#  id year name     job job2 keep
#1  1 1990  Bon Manager    0    0
#2  1 1991  Bon Manager    0    0
#3  1 1992  Bon Manager    0    0
#4  1 1993  Bon    Boss    1    1
#6  2 1990 Jane Manager    0    0
#7  2 1991 Jane    Boss    1    1

一个data.table解决方案:

dat <- as.data.table(dat)
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]

#   name id year     job job2
#1:  Bon  1 1990 Manager    0
#2:  Bon  1 1991 Manager    0
#3:  Bon  1 1992 Manager    0
#4:  Bon  1 1993    Boss    1
#5: Jane  2 1990 Manager    0
#6: Jane  2 1991    Boss    1

答案 2 :(得分:2)

图书馆'sqldf'可以完成这项工作。

library(sqldf)
miny <- sqldf("select id, min(year) as year from df where job='Boss' group by id")
sqldf("select df.* from df join miny on (df.id=miny.id and df.year<=miny.year)")

答案 3 :(得分:1)

如果您的数据存储在名为df的数据框中:

library(plyr)
ddply(.data=df, .variables=c("name"), .fun=function(x) {
  i <- which(x$job == "Boss")[1]
  if (!is.na(i)) x[1:i, ] # omit lifelong managers 
})
#   id year name     job job2
# 1  1 1990  Bon Manager    0
# 2  1 1991  Bon Manager    0
# 3  1 1992  Bon Manager    0
# 4  1 1993  Bon    Boss    1
# 5  2 1990 Jane Manager    0
# 6  2 1991 Jane    Boss    1