我有一个非常简单的数据集:
ID Value Time
1 censored 1
1 censored 2
1 uncensored 3
1 uncensored 4
1 censored 5
1 censored 6
2 censored 1
2 uncensored 2
2 uncensored 3
2 uncensored 4
2 censored 5
我希望保留第一个uncensored
次出现,并且我希望在censored
之后保留第一个uncensored
次出现。例如:
ID Value Time
1 uncensored 3
1 censored 5
2 uncensored 2
2 censored 5
并非所有人都在第5时间拥有第一个审查日期,这只是一个例子
Value
是一个二进制变量:1表示审查,0表示未经审查,但我已将其标记为。
答案 0 :(得分:4)
您可以使用标准的split-apply-combine策略执行此操作:
do.call(rbind, lapply(split(d, d$ID), function(x) {
u1 <- which(x$Value == "uncensored")[1]
c1 <- which((x$Value == "censored") & seq_along(x$Value) > u1)[1]
return(x[c(u1, c1),])
}))
结果:
ID Value Time
1.3 1 uncensored 3
1.5 1 censored 5
2.8 2 uncensored 2
2.11 2 censored 5
答案 1 :(得分:4)
这是另一种可能的data.table
解决方案
library(data.table)
setDT(df1)[, list(Value = c("uncensored", "censored"),
Time = c(Time[match("uncensored", Value)],
Time[(.N - match("uncensored", rev(Value))) + 2L])),
by = ID]
# ID Value Time
# 1: 1 uncensored 3
# 2: 1 censored 5
# 3: 2 uncensored 2
# 4: 2 censored 5
或类似地,使用which
代替match
setDT(df1)[, list(Value = c("uncensored", "censored"),
Time = c(Time[which(Value == "uncensored")[1L]],
Time[(.N - which(rev(Value) == "uncensored")[1L]) + 2L])),
by = ID]
答案 2 :(得分:2)
尝试
library(data.table)
indx <- setDT(df1)[, gr:= rleid(Value), ID
][, c(.I[Value=='uncensored'][1L], .I[Value=='censored' & gr>1][1L]) , ID]$V1
df1[indx][,gr:=NULL]
# ID Value Time
#1: 1 uncensored 3
#2: 1 censored 5
#3: 2 uncensored 2
#4: 2 censored 5
或使用与@Thomas帖子中类似的想法
indx <- setDT(df1)[, {
i1 <-.I[Value=='uncensored'][1L]
i2=.I[Value=='censored']
list(c(i1,i2[i2>i1][1L])) }, ID]$V1
df1[indx]
# ID Value Time
#1: 1 uncensored 3
#2: 1 censored 5
#3: 2 uncensored 2
#4: 2 censored 5
或使用dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which(Value=='uncensored')[1L]:n()) %>%
slice(match(c('uncensored', 'censored'), Value))
# ID Value Time
#1 1 uncensored 3
#2 1 censored 5
#3 2 uncensored 2
#4 2 censored 5
答案 3 :(得分:1)
由于您提到Value
是二进制变量,因此使用dplyr
是另一个想法:
library(dplyr)
df %>%
group_by(ID) %>%
## convert the labels to binary
## 1 for censored, and 0 for uncensored
mutate(Value = ifelse(Value == "censored", 1, 0)) %>%
## filter first 'uncensored' value in each 'ID' group
## or the 'censored' values that have 'uncensored' as a predecessor
filter(Value == 0 & row_number(Value) == 1 | Value == 1 & lag(Value) == 0)
给出了:
#Source: local data frame [4 x 3]
#Groups: ID
#
# ID Value Time
#1 1 0 3
#2 1 1 5
#3 2 0 2
#4 2 1 5
答案 4 :(得分:0)
尝试
result=c()
for(i in unique(df$ID)){
subdf = df[which(df$ID) == i), ]
idx = min(which(subdf$Value == 0))
result = rbind(result, subdf[idx, ])
idx = min(which(subdf$Value[-(1:idx)] == 1))
result = rbind(result, subdf[idx, ])
}
假设所需的观察始终存在。
答案 5 :(得分:0)
只要您希望识别某些列具有惯性的行(即使是具有多个级别或数字列的分类列),也可以应用以下内容
df <- read.table("clipboard")
a <- c(TRUE)
for (i in 1:(nrow(df)-1))
{
a <- c(a,duplicated(df[i:(i+1),2])[2])
}
df[!a,]