我有一个如下所示的数据集:
ID Cond Time1 Time2
1 2 Start Stop1
1 3 Start abc
1 1 abc Stop2
1 2 Start abc
1 2 abc Stop1
2 2 Start abc
2 4 abc jkl
2 3 abc jkl
2 2 abc jkl
2 3 abc Stop2
3 2 Start abc
3 3 abc Stop2
3 2 Start Stop1
3 3 Start Stop1
3 3 Start abc
3 2 abc jkl
3 4 baba Stop1
4 2 Start Stop2
4 1 Start asd
4 2 abc Stop2
我需要根据几个标准过滤数据。如果是Cond = 2
和Time1 = Start
,我需要过滤到第一个停靠点(Stop1
或Stop2
)。基本上,它应该是这样的:
ID Cond Time1 Time2
1 2 Start Stop1
1 2 Start abc
1 2 abc Stop1
2 2 Start abc
2 4 abc jkl
2 3 abc jkl
2 2 abc jkl
2 3 abc Stop2
3 2 Start abc
3 3 abc Stop2
3 2 Start Stop1
4 2 Start Stop2
此外,真实数据集有超过140,000个观测值,因此效率是关键。我在考虑使用dplyr
包,但不确定如何解决这个问题。
答案 0 :(得分:2)
使用dplyr
dframe = read.table(header = T, text = "ID Cond Time1 Time2
1 2 Start Stop1
1 3 Start abc
1 1 abc Stop2
1 2 Start abc
1 2 abc Stop1
2 2 Start abc
2 4 abc jkl
2 3 abc jkl
2 2 abc jkl
2 3 abc Stop2
3 2 Start abc
3 3 abc Stop2
3 2 Start Stop1
3 3 Start Stop1
3 3 Start abc
3 2 abc jkl
3 4 baba Stop1
4 2 Start Stop2
4 1 Start asd
4 2 abc Stop2")
library(dplyr)
# add index
dframe = data.frame(index = 1:nrow(dframe), dframe)
head(dframe)
# get starting points
start_points = dframe %>%
filter(Cond == 2 & Time1 == 'Start') %>%
select(index, ID)
# get stopping points
stop_points = dframe %>%
filter(substr(Time2, 1, 4) == 'Stop') %>%
select(index, ID)
# get the stopping point associated with each start point
start_stop = start_points %>%
left_join(stop_points, by = "ID") %>%
filter(index.x <= index.y) %>%
group_by(ID, index.x) %>%
summarise(index.y = min(index.y)) %>%
ungroup() %>%
rename(start_index = index.x, stop_index = index.y)
# add rows between
result = start_stop %>%
left_join(dframe, by = "ID") %>%
filter(start_index <= index, index <= stop_index) %>%
select(-c(start_index, stop_index, index))
> result
Source: local data frame [12 x 4]
ID Cond Time1 Time2
(int) (int) (fctr) (fctr)
1 1 2 Start Stop1
2 1 2 Start abc
3 1 2 abc Stop1
4 2 2 Start abc
5 2 4 abc jkl
6 2 3 abc jkl
7 2 2 abc jkl
8 2 3 abc Stop2
9 3 2 Start abc
10 3 3 abc Stop2
11 3 2 Start Stop1
12 4 2 Start Stop2
答案 1 :(得分:2)
另一个data.table解决方案:
library(data.table)
setDT(DF)
DF[, s0 := cumsum(Cond==2 & Time1 == "Start")]
DF[.N:1, s1 := cumsum(Time2 %like% "Stop")]
DF[, .SD[ s1 == s1[1L] ], by=s0]
s0 ID Cond Time1 Time2 s1
1: 1 1 2 Start Stop1 10
2: 2 1 2 Start abc 8
3: 2 1 2 abc Stop1 8
4: 3 2 2 Start abc 7
5: 3 2 4 abc jkl 7
6: 3 2 3 abc jkl 7
7: 3 2 2 abc jkl 7
8: 3 2 3 abc Stop2 7
9: 4 3 2 Start abc 6
10: 4 3 3 abc Stop2 6
11: 5 3 2 Start Stop1 5
12: 6 4 2 Start Stop2 2
.SD
是与每个by=s0
组相关联的数据子集。第二行中的.N:1
会临时反转数据以创建s1
。如果您不想保留新列,可以将其删除,例如DF[, s0 := NULL][, s1 := NULL]
或DF[, c("s0", "s1") := NULL]
。
如果最后一行很慢,则值得尝试@eddi's approach:
DF[DF[, .I[ s1 == s1[1L] ], by=s0]$V1]
答案 2 :(得分:1)
您可以使用Map
有条件地构造要选择的一系列行,其中可以使用匿名函数来判断开始时间是否具有条件2.这是一个解决方案,我们使用{ {1}}用于语法糖:
data.table
稍微提高性能:
library(data.table)
setDT(df)
df[unlist(Map(function(t1, t2) if(t1 %in% which(Cond == 2)) t1:t2 else NULL,
which(Time1 == "Start"), which(grepl("Stop", Time2))))]
ID Cond Time1 Time2
1: 1 2 Start Stop1
2: 1 2 Start abc
3: 1 2 abc Stop1
4: 2 2 Start abc
5: 2 4 abc jkl
6: 2 3 abc jkl
7: 2 2 abc jkl
8: 2 3 abc Stop2
9: 3 2 Start abc
10: 3 3 abc Stop2
11: 3 2 Start Stop1
12: 4 2 Start Stop2