使用过去的迭代向量化移动窗口

时间:2019-05-07 19:51:02

标签: r loops vectorization

给出了一个非常大的数据集(> 100万个观测值),并试图对我的逻辑进行矢量化处理,但是还没有找到解决问题的R化方法。

问题在于,每当变量中出现“不良”观察值时,都需要检查前面的5个观察值是否为“良好”指标。只要前面有5个“良好”观察值,就保留“不良”观察值。如果在5个观察移动窗口内有“不良”观察,那么该观察最终将被从分析中删除。

到目前为止,我已经尝试使用带有ifelse()逻辑的for循环。逻辑检查出来,但是使用R进行处理需要花费数小时才能完成。我已经研究了zoo包中的滚动窗口,但是没有应用诸如mean()sum()之类的聚合函数。我还研究了apply()lapply()等,但无法使其正常工作。

这是我的代码,用于for循环。假设df$Observation是“好vs不好”的初始名称,而df$Result是我们是否保留观察值的确定。

修改

set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 1000, T, c(0.9,0.1)))

for(i in 1:nrow(df)){
  ifelse(
    df$Observation[i] == "Good",
    df$Result[i] <- "Keep",
    ifelse(
      df$Observation[i] == "Bad" &
        df$Observation[i-1] == "Good" &
        df$Observation[i-2] == "Good" &
        df$Observation[i-3] == "Good" &
        df$Observation[i-4] == "Good",
      df$Result[i] <- "Keep",
      df$Result[i] <- "Drop"
    )
  )
}

所需结果示例:

df[385:393,]

    Observation Result
385        Good   Keep
386        Good   Keep
387        Good   Keep
388        Good   Keep
389        Good   Keep
390         Bad   Keep
391        Good   Keep
392        Good   Keep
393         Bad   Drop

代码按预期工作,但是我需要一种更有效的方法在R中执行它。谢谢您的帮助!

3 个答案:

答案 0 :(得分:2)

我喜欢zoo。除了第一个坏的实例(之前只有3个Obs)之外,其他一切似乎都匹配。您可以调整逻辑以使用fill = 4

library(tidyverse)
library(zoo)

df_decision <-
  df %>% 
  mutate(
    good_ind = as.integer(Observation == "Good"),
    good_count = rollsum(good_ind, 5, align = "right", fill = good_ind),
    result =ifelse(good_ind == 1 | good_count >= 4, "keep", "drop")
  )

答案 1 :(得分:1)

您可以执行以下操作:

首先,我设置种子,创建一些示例数据并打开必要的包。

set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 1000, T, c(0.9,0.1)))
library(zoo)
library(dplyr)

起初我落后一排。从那里,我为滞后的行和前四行计算rollmax。然后,我将此rollmax1进行比较。如果计算结果为TRUE并且当前行等于"Bad",则Result将是"Drop",否则它将是"KEEP"

df2 <- df %>% 
  mutate(Result = if_else(rollmax(lag(Observation) == "Bad", 5, fill = 0, align = "right") == 1 & Observation == "Bad", "Drop", "Keep")) 

这样,它将与您的预期输出匹配:

 df2[385:393,]
    Observation Result
385        Good   Keep
386        Good   Keep
387        Good   Keep
388        Good   Keep
389        Good   Keep
390         Bad   Keep
391        Good   Keep
392        Good   Keep
393         Bad   Drop

答案 2 :(得分:1)

如果用某些dplyr函数替换循环,事情真的会加速。请注意前5行的处理。 dplyr版本将在前5行中删除所有“ Bad”观测值,而您的循环将保留它们。如果需要,可以向case_when添加更多逻辑。

library(tictoc)
library(dplyr)

set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 10000, TRUE, c(0.9,0.1)))
df2 <- df

tic("loop")
for(i in 1:nrow(df)){
  ifelse(
    df$Observation[i] == "Good",
    df$Result[i] <- "Keep",
    ifelse(
      df$Observation[i] == "Bad" &
        df$Observation[i-1] == "Good" &
        df$Observation[i-2] == "Good" &
        df$Observation[i-3] == "Good" &
        df$Observation[i-4] == "Good",
      df$Result[i] <- "Keep",
      df$Result[i] <- "Drop"
    )
  )
}
toc() # 3.9s

tic("dplyr")
df2 <- df2 %>% 
  dplyr::mutate(
    L1 = dplyr::lag(Observation, 1),
    L2 = dplyr::lag(Observation, 2),
    L3 = dplyr::lag(Observation, 3),
    L4 = dplyr::lag(Observation, 4),
    L5 = dplyr::lag(Observation, 5),
    Result = dplyr::case_when(
      Observation == "Good" ~ "Keep",
      L1 == "Good" & 
        L2 == "Good" & 
        L3 == "Good" & 
        L4 == "Good" & 
        L5 == "Good" ~ "Keep",
      TRUE ~ "Drop"
    )
  ) %>% 
  dplyr::select(Observation, Result)
toc() # 0.08s