给出了一个非常大的数据集(> 100万个观测值),并试图对我的逻辑进行矢量化处理,但是还没有找到解决问题的R化方法。
问题在于,每当变量中出现“不良”观察值时,都需要检查前面的5个观察值是否为“良好”指标。只要前面有5个“良好”观察值,就保留“不良”观察值。如果在5个观察移动窗口内有“不良”观察,那么该观察最终将被从分析中删除。
到目前为止,我已经尝试使用带有ifelse()
逻辑的for循环。逻辑检查出来,但是使用R进行处理需要花费数小时才能完成。我已经研究了zoo
包中的滚动窗口,但是没有应用诸如mean()
或sum()
之类的聚合函数。我还研究了apply()
,lapply()
等,但无法使其正常工作。
这是我的代码,用于for循环。假设df$Observation
是“好vs不好”的初始名称,而df$Result
是我们是否保留观察值的确定。
修改
set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 1000, T, c(0.9,0.1)))
for(i in 1:nrow(df)){
ifelse(
df$Observation[i] == "Good",
df$Result[i] <- "Keep",
ifelse(
df$Observation[i] == "Bad" &
df$Observation[i-1] == "Good" &
df$Observation[i-2] == "Good" &
df$Observation[i-3] == "Good" &
df$Observation[i-4] == "Good",
df$Result[i] <- "Keep",
df$Result[i] <- "Drop"
)
)
}
所需结果示例:
df[385:393,]
Observation Result
385 Good Keep
386 Good Keep
387 Good Keep
388 Good Keep
389 Good Keep
390 Bad Keep
391 Good Keep
392 Good Keep
393 Bad Drop
代码按预期工作,但是我需要一种更有效的方法在R中执行它。谢谢您的帮助!
答案 0 :(得分:2)
我喜欢zoo
。除了第一个坏的实例(之前只有3个Obs)之外,其他一切似乎都匹配。您可以调整逻辑以使用fill = 4
library(tidyverse)
library(zoo)
df_decision <-
df %>%
mutate(
good_ind = as.integer(Observation == "Good"),
good_count = rollsum(good_ind, 5, align = "right", fill = good_ind),
result =ifelse(good_ind == 1 | good_count >= 4, "keep", "drop")
)
答案 1 :(得分:1)
您可以执行以下操作:
首先,我设置种子,创建一些示例数据并打开必要的包。
set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 1000, T, c(0.9,0.1)))
library(zoo)
library(dplyr)
起初我落后一排。从那里,我为滞后的行和前四行计算rollmax
。然后,我将此rollmax
与1
进行比较。如果计算结果为TRUE
并且当前行等于"Bad"
,则Result
将是"Drop"
,否则它将是"KEEP"
。
df2 <- df %>%
mutate(Result = if_else(rollmax(lag(Observation) == "Bad", 5, fill = 0, align = "right") == 1 & Observation == "Bad", "Drop", "Keep"))
这样,它将与您的预期输出匹配:
df2[385:393,]
Observation Result
385 Good Keep
386 Good Keep
387 Good Keep
388 Good Keep
389 Good Keep
390 Bad Keep
391 Good Keep
392 Good Keep
393 Bad Drop
答案 2 :(得分:1)
如果用某些dplyr
函数替换循环,事情真的会加速。请注意前5行的处理。 dplyr
版本将在前5行中删除所有“ Bad”观测值,而您的循环将保留它们。如果需要,可以向case_when
添加更多逻辑。
library(tictoc)
library(dplyr)
set.seed(1)
df <- data.frame(Observation = sample(c("Good", "Bad"), 10000, TRUE, c(0.9,0.1)))
df2 <- df
tic("loop")
for(i in 1:nrow(df)){
ifelse(
df$Observation[i] == "Good",
df$Result[i] <- "Keep",
ifelse(
df$Observation[i] == "Bad" &
df$Observation[i-1] == "Good" &
df$Observation[i-2] == "Good" &
df$Observation[i-3] == "Good" &
df$Observation[i-4] == "Good",
df$Result[i] <- "Keep",
df$Result[i] <- "Drop"
)
)
}
toc() # 3.9s
tic("dplyr")
df2 <- df2 %>%
dplyr::mutate(
L1 = dplyr::lag(Observation, 1),
L2 = dplyr::lag(Observation, 2),
L3 = dplyr::lag(Observation, 3),
L4 = dplyr::lag(Observation, 4),
L5 = dplyr::lag(Observation, 5),
Result = dplyr::case_when(
Observation == "Good" ~ "Keep",
L1 == "Good" &
L2 == "Good" &
L3 == "Good" &
L4 == "Good" &
L5 == "Good" ~ "Keep",
TRUE ~ "Drop"
)
) %>%
dplyr::select(Observation, Result)
toc() # 0.08s