我喜欢在R中浏览200万行的方法,并找到指定的事件发生时的第一个实例,并跟踪它们的时间发生。注意:(1)start event
次应在end event x
之前发生,第2行中的{2} end event x
/ end event z
应出现在第2行的start event
之前,等等。
我发现的最接近的另一个例子是:R - Keep first observation per group identified by multiple variables (Stata equivalent "bys var1 var2 : keep if _n == 1");
我的问题是不同,因为我需要1)查看多个条件,并且只有符合条件(threshold
,status
等)的行才会包含行,并且2)需要以不同方式对其进行格式化(即拉出Timestamp
值
答案 0 :(得分:1)
可能不是最优雅的解决方案,但似乎可以完成工作。
library(tidyverse)
d <- read_csv(
"ID, Timestamp, Enable, Status, Deviation, Threshold
a, 6/10/2015 10:10, 0, 0, 0.5, 0.65
a, 6/10/2015 10:15, 0, 0, 0.6, 0.65
a, 6/10/2015 10:20, 0, 0, 0.75, 0.65
a, 6/10/2015 10:25, 1, 0, 0.8, 0.65
a, 6/10/2015 10:30, 1, 0, 0.9, 0.65
a, 6/10/2015 10:35, 1, 0, 0.8, 0.65
a, 6/10/2015 10:40, 1, 1, 0.7, 0.65
a, 6/10/2015 10:45, 1, 1, 0.5, 0.65
a, 6/10/2015 10:50, 0, 0, 0.6, 0.65
a, 6/10/2015 10:55, 0, 0, 0.7, 0.65
a, 6/10/2015 11:00, 1, 0, 0.8, 0.65
a, 6/10/2015 11:05, 1, 0, 0.9, 0.65
a, 6/10/2015 11:10, 1, 1, 1, 0.65
a, 6/10/2015 11:15, 1, 1, 0.8, 0.65
a, 6/10/2015 11:20, 1, 1, 0.7, 0.65
b, 7/10/2015 11:20, 0, 0, 0.4, 0.5
b, 7/11/2015 11:25, 0, 0, 0.6, 0.5
b, 7/12/2015 11:30, 1, 0, 0.7, 0.5
b, 7/13/2015 11:35, 1, 1, 0.8, 0.5")
d %>%
mutate(
start = ifelse(Enable == 0 & Deviation > Threshold & Status == 0,
1,
0),
end_x = ifelse(Enable == 1 & Deviation > Threshold,
1,
0),
end_z = ifelse(Enable == 1 & Deviation > Threshold & Status == 1,
1,
0)) %>%
gather(var, val, start:end_z) %>% # gather them into a single variable
filter(val == 1) %>% # remove dummy coding
select(ID, Timestamp, var) %>% # remove unnecessary variables
group_by(ID, var) %>%
mutate(count = 1:n()) %>% # create count variable so rows are uniquely identified
spread(var, Timestamp) %>% # spread it back out
select(ID, start, end_x, end_z) %>%
na.omit()
ID start end_x end_z
<chr> <chr> <chr> <chr>
1 a 6/10/2015 10:20 6/10/2015 10:25 6/10/2015 10:40
2 a 6/10/2015 10:55 6/10/2015 10:30 6/10/2015 11:10
3 b 7/11/2015 11:25 7/12/2015 11:30 7/13/2015 11:35
答案 1 :(得分:1)
对于每个ID&#39;,请使用cumsum
根据&#39; start&#39;创建分组变量&#39; g。对于每个ID&#39;并且&#39; g&#39;,选择相关的行。
library(data.table)
setDT(d)
d[ , g := cumsum(Enable == 0 & Deviation > Threshold & Status == 0), by = ID]
d[g > 0, .(start = Timestamp[1],
end_x = Timestamp[Enable == 1 & Deviation > Threshold][1],
end_z = Timestamp[Enable == 1 & Deviation > Threshold & Status == 1][1]),
by = .(ID, g)]
# ID g start end_x end_z
# 1: a 1 6/10/2015 10:20 6/10/2015 10:25 6/10/2015 10:40
# 2: a 2 6/10/2015 10:55 6/10/2015 11:00 6/10/2015 11:10
# 3: b 1 7/11/2015 11:25 7/12/2015 11:30 7/13/2015 11:35
答案 2 :(得分:1)
使用dplyr,tidyr和data.table的解决方案。 case_when
可以方便地分配条件。之后,删除Flag
中包含NA的行,然后在Flag2
中指定游程长度ID,过滤Flag2
中的第一行,分配Flag2
,最后传播数据帧。
library(dplyr)
library(tidyr)
library(data.table)
dat2 <- dat %>%
mutate(Flag = case_when(
Enable == 0 & Deviation > Threshold & Status == 0 ~ "Start Event Time",
Enable == 1 & Deviation > Threshold & Status == 0 ~ "End Event x Time",
Enable == 1 & Deviation > Threshold & Status == 1 ~ "End Event z Time",
TRUE ~ NA_character_
)) %>%
drop_na(Flag) %>%
mutate(Flag2 = rleid(Flag)) %>%
group_by(Flag2) %>%
slice(1) %>%
ungroup() %>%
mutate(x=cumsum(Flag == "Start Event Time")) %>%
group_by(x) %>%
filter(!(duplicated(Flag) & (Flag =='End Event x Time' | Flag =='End Event z Time'))) %>%
spread(Flag, Timestamp, x) %>%
select(ID, `Start Event Time`, `End Event x Time`, `End Event z Time`)
dat2
# # A tibble: 3 x 4
# ID `Start Event Time` `End Event x Time` `End Event z Time`
# * <chr> <chr> <chr> <chr>
# 1 a 6/10/2015 10:20 6/10/2015 10:25 6/10/2015 10:40
# 2 a 6/10/2015 10:55 6/10/2015 11:00 6/10/2015 11:10
# 3 b 7/11/2015 11:25 7/12/2015 11:30 7/13/2015 11:35