在起点和终点之间过滤

时间:2016-06-21 14:27:15

标签: r dplyr

我有一个如下所示的数据集:

ID  Cond    Time1   Time2
1   2       Start   Stop1
1   3       Start   abc
1   1       abc     Stop2
1   2       Start   abc
1   2       abc     Stop1
2   2       Start   abc
2   4       abc     jkl
2   3       abc     jkl
2   2       abc     jkl
2   3       abc     Stop2
3   2       Start   abc
3   3       abc     Stop2
3   2       Start   Stop1
3   3       Start   Stop1
3   3       Start   abc
3   2       abc     jkl
3   4       baba    Stop1
4   2       Start   Stop2
4   1       Start   asd
4   2       abc     Stop2

我需要根据几个标准过滤数据。如果是Cond = 2Time1 = Start,我需要过滤到第一个停靠点(Stop1Stop2)。基本上,它应该是这样的:

ID  Cond    Time1   Time2
1   2       Start   Stop1
1   2       Start   abc
1   2       abc     Stop1
2   2       Start   abc
2   4       abc     jkl
2   3       abc     jkl
2   2       abc     jkl
2   3       abc     Stop2
3   2       Start   abc
3   3       abc     Stop2
3   2       Start   Stop1
4   2       Start   Stop2

此外,真实数据集有超过140,000个观测值,因此效率是关键。我在考虑使用dplyr包,但不确定如何解决这个问题。

3 个答案:

答案 0 :(得分:2)

使用dplyr

dframe = read.table(header = T, text = "ID  Cond    Time1   Time2
1   2       Start   Stop1
                    1   3       Start   abc
                    1   1       abc     Stop2
                    1   2       Start   abc
                    1   2       abc     Stop1
                    2   2       Start   abc
                    2   4       abc     jkl
                    2   3       abc     jkl
                    2   2       abc     jkl
                    2   3       abc     Stop2
                    3   2       Start   abc
                    3   3       abc     Stop2
                    3   2       Start   Stop1
                    3   3       Start   Stop1
                    3   3       Start   abc
                    3   2       abc     jkl
                    3   4       baba    Stop1
                    4   2       Start   Stop2
                    4   1       Start   asd
                    4   2       abc     Stop2")

library(dplyr)

# add index
dframe = data.frame(index = 1:nrow(dframe), dframe)
head(dframe)

# get starting points
start_points = dframe %>%
  filter(Cond == 2 & Time1 == 'Start') %>%
  select(index, ID)

# get stopping points
stop_points = dframe %>%
  filter(substr(Time2, 1, 4) == 'Stop') %>%
  select(index, ID)

# get the stopping point associated with each start point
start_stop = start_points %>%
  left_join(stop_points, by = "ID") %>%
  filter(index.x <= index.y) %>%
  group_by(ID, index.x) %>%
  summarise(index.y = min(index.y)) %>%
  ungroup() %>%
  rename(start_index = index.x, stop_index = index.y)

# add rows between
result = start_stop %>%
  left_join(dframe, by = "ID") %>%
  filter(start_index <= index, index <= stop_index) %>%
  select(-c(start_index, stop_index, index))

> result
Source: local data frame [12 x 4]

ID  Cond  Time1  Time2
(int) (int) (fctr) (fctr)
1      1     2  Start  Stop1
2      1     2  Start    abc
3      1     2    abc  Stop1
4      2     2  Start    abc
5      2     4    abc    jkl
6      2     3    abc    jkl
7      2     2    abc    jkl
8      2     3    abc  Stop2
9      3     2  Start    abc
10     3     3    abc  Stop2
11     3     2  Start  Stop1
12     4     2  Start  Stop2

答案 1 :(得分:2)

另一个data.table解决方案:

library(data.table)
setDT(DF)
DF[,     s0 := cumsum(Cond==2 & Time1 == "Start")]
DF[.N:1, s1 := cumsum(Time2 %like% "Stop")]

DF[, .SD[ s1 == s1[1L] ], by=s0]

    s0 ID Cond Time1 Time2 s1
 1:  1  1    2 Start Stop1 10
 2:  2  1    2 Start   abc  8
 3:  2  1    2   abc Stop1  8
 4:  3  2    2 Start   abc  7
 5:  3  2    4   abc   jkl  7
 6:  3  2    3   abc   jkl  7
 7:  3  2    2   abc   jkl  7
 8:  3  2    3   abc Stop2  7
 9:  4  3    2 Start   abc  6
10:  4  3    3   abc Stop2  6
11:  5  3    2 Start Stop1  5
12:  6  4    2 Start Stop2  2

.SD是与每个by=s0组相关联的数据子集。第二行中的.N:1会临时反转数据以创建s1。如果您不想保留新列,可以将其删除,例如DF[, s0 := NULL][, s1 := NULL]DF[, c("s0", "s1") := NULL]

如果最后一行很慢,则值得尝试@eddi's approach

DF[DF[, .I[ s1 == s1[1L] ], by=s0]$V1]

答案 2 :(得分:1)

您可以使用Map有条件地构造要选择的一系列行,其中可以使用匿名函数来判断开始时间是否具有条件2.这是一个解决方案,我们使用{ {1}}用于语法糖:

data.table

稍微提高性能:

library(data.table)
setDT(df)
df[unlist(Map(function(t1, t2) if(t1 %in% which(Cond == 2)) t1:t2 else NULL, 
              which(Time1 == "Start"), which(grepl("Stop", Time2))))]
    ID Cond Time1 Time2
 1:  1    2 Start Stop1
 2:  1    2 Start   abc
 3:  1    2   abc Stop1
 4:  2    2 Start   abc
 5:  2    4   abc   jkl
 6:  2    3   abc   jkl
 7:  2    2   abc   jkl
 8:  2    3   abc Stop2
 9:  3    2 Start   abc
10:  3    3   abc Stop2
11:  3    2 Start Stop1
12:  4    2 Start Stop2