Question

我正在尝试避免for循环并使用apply代替我检测到的后处理标记。

我有一个时间序列，其中一列显示质量是否合格。以下是数据框的外观：

n <- 100
tstart <- strptime("12/15/16 16:00:00", "%m/%d/%y %H:%M:%S")
df <- data.frame(Date = tstart + seq(0,n*5-1,5) + sample(seq(0,3,1), n, replace = T),
             Check = sample(c("FLAG", "PASS"), n, replace = T))

# head of df
#         Date           Check
# 1 2016-12-15 16:00:02  FLAG
# 2 2016-12-15 16:00:05  PASS
# 3 2016-12-15 16:00:13  FLAG
# 4 2016-12-15 16:00:17  PASS
# 5 2016-12-15 16:00:22  FLAG
# 6 2016-12-15 16:00:26  FLAG

我不喜欢拿起所有的FLAG。我想申请三个条件：

1）忽略与前一行的时差超过60秒的标志

2）我想保留已经重复一段时间的标志。

以下是我如何实现这一点：

df$Time_Difference <- c(0,as.numeric(diff(df$Date)))
df$Flag_Counter <- 0
desired_rep <- 3
# Start the clock!
ptm <- proc.time()
for (row_index in 2:nrow(df)){
    if (df[row_index, "Time_Difference"] > 60){
        df[row_index, "Flag_Counter"] <- 0
    }
    else {
        if (df[row_index, "Check"] == "PASS"){
            df[row_index, "Flag_Counter"] <- max(0, df[row_index-1, "Flag_Counter"] - 1)
        }
        else {
            df[row_index, "Flag_Counter"] <- min(desired_rep, df[row_index-1, "Flag_Counter"] + 1)
        }
    }
}
# Stop the clock
x <- proc.time() - ptm
print(x[3])

所以，for循环实际上是连续重复desired_rep次的标志。如果我们在两个PASS之后有FLAG，1是Flag_Counter，最后我们df[, df$Flag_Counter == 3]我们可以使用后处理标记。现在，这非常缓慢。我想知道我们是否可以使用apply更快地完成此任务。我在Python中完成了此操作，但我不知道如何访问预定义函数中的先前行，然后使用apply。我感谢您的帮助。

Answer 1

试试这个：

desired_rep = 3

# If Time_Difference > 60, 0, otherwise 1 if "Flag", -1 if "Pass"
df$temp = ifelse(df$Check=='FLAG',1,-1)*(df$Time_Difference<=60)

# Do a "cumsum" that's bounded between 0 and 3, and resets to 0 if Time_Difference > 60
df$Flag_Counter = Reduce(function(x,y) max(0, min(desired_rep,x+y))*(y!=0), df$temp, acc=T)

通常，当您需要更新＆＃34;状态时，Reduce()非常有用。顺序地，输入是单个列表/向量（这里是temp列）的限制。

Answer 2

尝试一下：

n <- 100
tstart <- strptime("12/15/16 16:00:00", "%m/%d/%y %H:%M:%S")
df <- data.frame(Date = tstart + seq(0,n*5-1,5) + sample(seq(0,3,1), n, replace = T),
                 Check = sample(c("FLAG", "PASS"), n, replace = T))

desired_rep <- 3 #set the desired repetition limit

您在示例代码中使用的时间是End_Time。我假设这应该是原始数据集中的Date？

df$Time_Difference <- c(0,as.numeric(diff(df$Date)))

找到连续的标志。感谢post。

df$consecutive_flag_count <- sequence(rle(as.character(df$Check))$lengths)

创建check_again列，如果OK为Check或Pass小于60且且少于{{Time_Difference，则会返回desired_rep 1}}连续Check。

df$check_again <- ifelse(df$Check == "PASS", "OK", 
 ifelse(df$Time_Difference < 60 & df$consecutive_flag_count >= desired_rep, "CHECK_AGAIN","OK"))

然后，您可以轻松过滤到CHECK_AGAIN项目，如下所示。

df_check_again <- df[df$check_again == "CHECK_AGAIN", ]
> df_check_again
                  Date Check Time_Difference consecutive_flag_count check_again
3  2016-12-15 16:00:11  FLAG               4                      3 CHECK_AGAIN
4  2016-12-15 16:00:18  FLAG               7                      4 CHECK_AGAIN
17 2016-12-15 16:01:23  FLAG               5                      3 CHECK_AGAIN
18 2016-12-15 16:01:26  FLAG               3                      4 CHECK_AGAIN
19 2016-12-15 16:01:30  FLAG               4                      5 CHECK_AGAIN
20 2016-12-15 16:01:37  FLAG               7                      6 CHECK_AGAIN
27 2016-12-15 16:02:10  FLAG               3                      3 CHECK_AGAIN
28 2016-12-15 16:02:18  FLAG               8                      4 CHECK_AGAIN
29 2016-12-15 16:02:20  FLAG               2                      5 CHECK_AGAIN
42 2016-12-15 16:03:27  FLAG               4                      3 CHECK_AGAIN
43 2016-12-15 16:03:33  FLAG               6                      4 CHECK_AGAIN
44 2016-12-15 16:03:38  FLAG               5                      5 CHECK_AGAIN
55 2016-12-15 16:04:33  FLAG               7                      3 CHECK_AGAIN
56 2016-12-15 16:04:36  FLAG               3                      4 CHECK_AGAIN
57 2016-12-15 16:04:41  FLAG               5                      5 CHECK_AGAIN
58 2016-12-15 16:04:45  FLAG               4                      6 CHECK_AGAIN
85 2016-12-15 16:07:02  FLAG               7                      3 CHECK_AGAIN
>

使用先前的行而不是for循环实现apply

2 个答案: