替代复杂for循环以提高性能

时间:2016-04-16 20:56:35

标签: r performance loops for-loop

以下代码使用20x1数据帧,检查每一行,如果下面的6行中的任何一行(即行i + 1到i + 7)大于3行,再低于2行(例如i + 1) - i + 4> 2)。如果为true,则在新创建的Signal列上记录1。

例如,对于第一行,它会检查是否:

  • 第2行>第5行+ 2或
  • 第3行>第6 + 2行或

...

  • 第7行>第10 + 2行

如果可能,我想找到for循环的替代方法。我在大型数据库上运行此模板代码,循环可能需要数小时。请注意,循环的代码有点复杂,以避免循环超出边界。非常感谢@Gregor在将它们整合在一起的大量帮助。

#Data
df <- data.frame(Price = c( 1221, 1220, 1220, 1217, 1216,  1218 , 1216, 1216, 1217, 1220, 1219, 1218, 1220, 1216, 1217, 1218, 1218, 1207, 1206, 1205))

#Inputs
Window = 6                # check up to this far below current row
IndexDifference = 3       # check row against another this far down
ValueDifference = 2       # for difference at least this big

#Define loop boundaries 
base_rows = 1:(nrow(df) - IndexDifference)  # can't check more than this
candidate_max = pmin(base_rows + Window, nrow(df) - IndexDifference) # for a given base row, this is the maximum row to start checking against

#Make Signal variable
df$Signal = rep(NA, nrow(df)) #pre-allocate variable
for (i in seq_along(base_rows)) {
  df$Signal[i] = as.numeric(
    any(
      df$Price[(i + 1):candidate_max[i]] - 
        df$Price[((i + 1):candidate_max[i]) + IndexDifference] > ValueDifference))}

3 个答案:

答案 0 :(得分:2)

这有点晚了,但万一它有用。

我同意@alexis_laz,计算的比较比必要的更多。我认为这个想法可以更进一步,因为如果any以滚动的方式应用,那也会带来不必要的计算。

关键是始终将给定行与另一个特定行(在您的示例中为3)进行比较。一旦我们知道该行的等价是否成立,在给定窗口中包含它的任何其他行应该被赋予值1(TRUE)。

这里有用的快捷方式是,如果行j的等效性成立,而行i为TRUE,行j也在行i+1的窗口内,然后i+1也为TRUE(无需知道窗口中其他点的状态)。我所得到的是,我们不需要为每一行的窗口确定any。如果我们知道行i窗口中有多少个TRUE,对于行i+1,我们只需要确定离开窗口的点是否为TRUE以及进入窗口的点是否为真正。基本上我们用一个Window - 宽度的盒子过滤矢量,然后只检查哪些条目在他们的窗口中至少有一个TRUE值(这可以在一次传递中完成,但让&# 39; s忽略,因为额外的时间并不重要)。

使用滚动总和,我们可以通过运行计数,包括/删除进入/离开窗口的点来有效地计算这一点。这是@ alexis_laz的观察结果:可以预先计算进入/离开的点的状态。

为了使事情更具体,这里有一些代码。首先,让我复制你的原始循环@Richard Telford的回答,以及@ alexis_laz的回答并将它们包装成函数(稍微重写主要是为了方便个人,因此输出格式匹配,以及希望没有添加任何错误):

f_G <- function(x, window, idiff, valdiff){
  base_rows = 1:(NROW(x) - idiff - 1)  # can't check more than this
candidate_max = pmin(base_rows + window, NROW(x) - idiff) # maximum row to start checking against
  out = rep(0, NROW(x)) #pre-allocate variable
  for (i in seq_along(base_rows)) {
    out[i] = as.numeric(any(x[(i + 1):candidate_max[i]]
           - x[((i + 1):candidate_max[i]) + idiff] > valdiff))}
  return(out)
}

f_RT <- function(x, window, idiff, valdiff){
  x0 <- cbind(x[-(1)][1:NROW(x)], sapply(2:window,
                                        function(i)x[-(1:i)][1:NROW(x)]))
  x1 <- sapply((idiff+1):(idiff+window),
              function(i)x[-(1:i)][1:NROW(x)])
  out <- as.numeric(apply((x0 - x1) > valdiff, 1, any, na.rm = TRUE))
  return(out)
}

f_AL <- function(x, window, idiff, valdiff){
  check = (x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff
  check <- c(check, rep(FALSE, idiff+1))
  out <- as.integer(sapply(seq_along(check),
                      function(i) any(check[i:min(length(check), (i + (window - 1)))])))
  return(out)
}

然后,这里有两个函数来计算上面描述的滚动和,在具有预先计算差异的向量上(如@alexis_laz建议)。第一个使用filter函数,而第二个使用roll_sum包中的RcppRoll

f_filt <- function(x, window, idiff, valdiff){
  ## calculate idiff differences once
  check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
  ## extend series to filter
  check <- c(check, rep(0, window+idiff))
  ## reverse series due to filter using "past" values
  ffilt <- rev(filter(rev(check), rep(1, window), sides=1))
  ## check if at least one
  out <- ifelse(na.omit(ffilt) > 0, 1, 0)
  return(out)
}

library(RcppRoll)
f_roll <- function(x, window, idiff, valdiff){
  ## calculate idiff differences once
  check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
  ## extend series to filter
  check <- c(check, rep(0, window+idiff))
  ## rolling window sum
  froll <- roll_sum(check, n=window, align="right")
  out <- ifelse(froll > 0, 1, 0)
  return(out)
}

作为一个快速检查,我们可以测试所有功能给出相同的答案:

f_G(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_RT(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_AL(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_filt(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_roll(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

现在让我们对它们进行基准测试。我还会增加要测试的行数。

library(microbenchmark)
w <- Window
idiff <- IndexDifference
vdiff <- ValueDifference

df2 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df2, w, idiff, vdiff),
               f_RT(df2, w, idiff, vdiff),
               f_AL(df2, w, idiff, vdiff),
               f_filt(df2, w, idiff, vdiff),
               f_roll(df2, w, idiff, vdiff)
               )
Unit: milliseconds
                         expr       min        lq      mean    median        uq       max neval   cld
    f_G(df2, w, idiff, vdiff) 395.80227 412.05120 419.88554 413.55551 417.84907 479.47306   100     e
   f_RT(df2, w, idiff, vdiff) 154.43919 192.99473 193.10029 195.61031 197.95933 236.27244   100   c  
   f_AL(df2, w, idiff, vdiff) 233.30237 244.01664 249.75449 245.07001 248.51249 319.04956   100    d 
 f_filt(df2, w, idiff, vdiff)  21.53997  22.51582  25.38218  22.59477  23.56873  63.48320   100  b   
 f_roll(df2, w, idiff, vdiff)  14.26333  14.35543  16.99302  15.24879  15.45127  55.49886   100 a    

最后,我们看到我们得到了相当不错的速度提升。以这种方式接近它的另一个巧妙的事情是,无论窗口大小如何,它都保持同样有效(特别是,直接进行滚动总和;使用filter确实减慢了一点,尽管它有点慢还是很快。)

w <- 25 #Window
df3 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df3, w, idiff, vdiff),
               f_RT(df3, w, idiff, vdiff),
               f_AL(df3, w, idiff, vdiff),
               f_filt(df3, w, idiff, vdiff),
               f_roll(df3, w, idiff, vdiff)
               )
Unit: milliseconds
                         expr       min        lq      mean    median        uq       max neval   cld
    f_G(df3, w, idiff, vdiff) 487.65798 516.67700 537.54019 541.34459 551.52128 592.05720   100     e
   f_RT(df3, w, idiff, vdiff) 328.44934 366.76176 389.08534 401.39053 409.49376 518.94535   100    d 
   f_AL(df3, w, idiff, vdiff) 240.99006 258.66045 263.21317 260.09258 263.75917 319.02493   100   c  
 f_filt(df3, w, idiff, vdiff)  37.32291  37.41098  38.97167  37.47234  38.40989  79.51684   100  b   
 f_roll(df3, w, idiff, vdiff)  15.49264  15.52950  15.86283  15.55252  15.62852  19.77415   100 a    

答案 1 :(得分:1)

这个问题的一个解决方案是构建两个滞后列的矩阵,并从另一个中减去一个。这在R中使用矢量化并且应该很快。

heroku logs

注意,这并没有给出与您的代码完全相同的结果,可能是因为

df0 <- cbind(df$Price[-(1)][1:nrow(df)], sapply(2:Window, function(i)df$Price[-(1:i)][1:nrow(df)]))
df1 <- sapply((IndexDifference+1):(IndexDifference+Window), function(i)df$Price[-(1:i)][1:nrow(df)])  
df$Signal <- as.numeric(apply((df0 - df1) > ValueDifference, 1, any, na.rm = TRUE))
df$Signal

评估为i = 17 (i + 1):candidate_max[i] ,这可能不是您想要的。

答案 2 :(得分:1)

在你的循环中,大多数Price[i] - Price[i + IndexDifference] > ValueDifference被计算多次;在这种情况下(最后的代码)大多数比较进行了6次:

#    [i]  [i + IndexDifference]  [times calculated]   
#    Var1 Var2 Freq
#70     2    5    1
#88     3    6    2
#106    4    7    3
#124    5    8    4
#142    6    9    5
#160    7   10    6
#178    8   11    6
#196    9   12    6
#214   10   13    6
#232   11   14    6
#250   12   15    6
#268   13   16    6
#286   14   17    6
#304   15   18    6
#322   16   19    6
#340   17   20    6

另外,我猜,它不仅仅是重复计算本身,而是重复分配(和子集化)到&#34; data.frame&#34; s。

相反,您可以计算差异和比较一次:

tmp = (df$Price[2:(nrow(df) - IndexDifference)] - 
      df$Price[(2 + IndexDifference):nrow(df)]) > ValueDifference

以滚动的方式申请any(注意你关于不出界的评论):

as.integer(sapply(seq_along(tmp), 
                  function(i) any(tmp[i:min(length(tmp), (i + (Window - 1)))])))
#[1] 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
                #and 4 values are left (rows 17:20 that cannot be 
                #calculated based on the conditions) to be added as `NA`

比较制表:

#re-calculcated your 'base_rows' to not include row 17 as it exceeds tha 'IndexDifference'
base_rows = 1:(nrow(df) - IndexDifference - 1L)  
candidate_max = pmin(base_rows + Window, nrow(df) - IndexDifference) 

#set-up the tabulations for each comparison     
table_diffs = matrix(0L, 
                     base_rows[length(base_rows)] + 1L,
                     candidate_max[length(candidate_max)] + IndexDifference)
for(i in seq_along(base_rows)) { 
    ij = cbind((i + 1):candidate_max[i], ((i + 1):candidate_max[i]) + IndexDifference)
    table_diffs[ij] = table_diffs[ij] + 1L
}   
#format
subset(transform(as.data.frame(as.table(table_diffs)), 
                 Var1 = as.integer(Var1), 
                 Var2 = as.integer(Var2)), 
       Freq != 0L)