根据R中的几个标准跟踪第一次观察

时间:2018-01-29 21:36:05

标签: r

我喜欢在R中浏览200万行的方法,并找到指定的事件发生时的第一个实例,并跟踪它们的时间发生。注意:(1)start event次应在end event x之前发生,第2行中的{2} end event x / end event z应出现在第2行的start event之前,等等。

我发现的最接近的另一个例子是:R - Keep first observation per group identified by multiple variables (Stata equivalent "bys var1 var2 : keep if _n == 1");

我的问题是不同,因为我需要1)查看多个条件,并且只有符合条件(thresholdstatus等)的行才会包含行,并且2)需要以不同方式对其进行格式化(即拉出Timestamp

3 个答案:

答案 0 :(得分:1)

可能不是最优雅的解决方案,但似乎可以完成工作。

library(tidyverse)
d <- read_csv(
"ID,  Timestamp,        Enable,      Status,     Deviation,   Threshold
a,   6/10/2015 10:10,     0,           0,           0.5,     0.65
a,   6/10/2015 10:15,     0,           0,           0.6,     0.65
a,   6/10/2015 10:20,     0,           0,           0.75,    0.65
a,   6/10/2015 10:25,     1,           0,           0.8,     0.65
a,   6/10/2015 10:30,     1,           0,           0.9,     0.65
a,   6/10/2015 10:35,     1,           0,           0.8,     0.65
a,   6/10/2015 10:40,     1,           1,           0.7,     0.65
a,   6/10/2015 10:45,     1,           1,           0.5,     0.65
a,   6/10/2015 10:50,     0,           0,           0.6,     0.65
a,   6/10/2015 10:55,     0,           0,           0.7,     0.65
a,   6/10/2015 11:00,     1,           0,           0.8,     0.65
a,   6/10/2015 11:05,     1,           0,           0.9,     0.65
a,   6/10/2015 11:10,     1,           1,           1,       0.65
a,   6/10/2015 11:15,     1,           1,           0.8,     0.65
a,   6/10/2015 11:20,     1,           1,           0.7,     0.65
b,   7/10/2015 11:20,     0,           0,           0.4,     0.5
b,   7/11/2015 11:25,     0,           0,           0.6,     0.5
b,   7/12/2015 11:30,     1,           0,           0.7,     0.5
b,   7/13/2015 11:35,     1,           1,           0.8,     0.5")

d %>% 
  mutate(
    start = ifelse(Enable == 0 & Deviation > Threshold & Status == 0,
               1, 
               0),
    end_x = ifelse(Enable == 1 & Deviation > Threshold, 
               1, 
               0),
    end_z = ifelse(Enable == 1 & Deviation > Threshold & Status == 1, 
               1, 
               0)) %>%
  gather(var, val, start:end_z) %>% # gather them into a single variable
  filter(val == 1) %>% # remove dummy coding
  select(ID, Timestamp, var) %>% # remove unnecessary variables
  group_by(ID, var) %>% 
  mutate(count = 1:n()) %>% # create count variable so rows are uniquely identified
  spread(var, Timestamp) %>% # spread it back out
  select(ID, start, end_x, end_z) %>% 
  na.omit()

  ID    start           end_x           end_z          
  <chr> <chr>           <chr>           <chr>          
1 a     6/10/2015 10:20 6/10/2015 10:25 6/10/2015 10:40
2 a     6/10/2015 10:55 6/10/2015 10:30 6/10/2015 11:10
3 b     7/11/2015 11:25 7/12/2015 11:30 7/13/2015 11:35

答案 1 :(得分:1)

对于每个ID&#39;,请使用cumsum根据&#39; start&#39;创建分组变量&#39; g。对于每个ID&#39;并且&#39; g&#39;,选择相关的行。

library(data.table)
setDT(d)
d[ , g := cumsum(Enable == 0 & Deviation > Threshold & Status == 0), by = ID]
d[g > 0, .(start = Timestamp[1],
           end_x = Timestamp[Enable == 1 & Deviation > Threshold][1],
           end_z = Timestamp[Enable == 1 & Deviation > Threshold & Status == 1][1]),
  by = .(ID, g)]
#       ID g              start              end_x              end_z
# 1:     a 1    6/10/2015 10:20    6/10/2015 10:25    6/10/2015 10:40
# 2:     a 2    6/10/2015 10:55    6/10/2015 11:00    6/10/2015 11:10
# 3:     b 1    7/11/2015 11:25    7/12/2015 11:30    7/13/2015 11:35

答案 2 :(得分:1)

使用的解决方案。 case_when可以方便地分配条件。之后,删除Flag中包含NA的行,然后在Flag2中指定游程长度ID,过滤Flag2中的第一行,分配Flag2,最后传播数据帧。

library(dplyr)
library(tidyr)
library(data.table)

dat2 <- dat %>%
  mutate(Flag = case_when(
    Enable == 0 & Deviation > Threshold & Status == 0        ~ "Start Event Time",
    Enable == 1 & Deviation > Threshold & Status == 0        ~ "End Event x Time",
    Enable == 1 & Deviation > Threshold & Status == 1        ~ "End Event z Time",
    TRUE                                                     ~ NA_character_
  )) %>%
  drop_na(Flag) %>%
  mutate(Flag2 = rleid(Flag)) %>%
  group_by(Flag2) %>%
  slice(1) %>%
  ungroup() %>%
  mutate(x=cumsum(Flag == "Start Event Time")) %>%
  group_by(x) %>%
  filter(!(duplicated(Flag) & (Flag =='End Event x Time' | Flag =='End Event z Time'))) %>% 
  spread(Flag, Timestamp, x) %>%
  select(ID, `Start Event Time`, `End Event x Time`, `End Event z Time`)
dat2
# # A tibble: 3 x 4
#   ID    `Start Event Time` `End Event x Time` `End Event z Time`
# * <chr> <chr>              <chr>              <chr>             
# 1 a     6/10/2015 10:20    6/10/2015 10:25    6/10/2015 10:40   
# 2 a     6/10/2015 10:55    6/10/2015 11:00    6/10/2015 11:10   
# 3 b     7/11/2015 11:25    7/12/2015 11:30    7/13/2015 11:35