以下代码使用20x1数据帧,检查每一行,如果下面的6行中的任何一行(即行i + 1到i + 7)大于3行,再低于2行(例如i + 1) - i + 4> 2)。如果为true,则在新创建的Signal列上记录1。
例如,对于第一行,它会检查是否:
...
如果可能,我想找到for循环的替代方法。我在大型数据库上运行此模板代码,循环可能需要数小时。请注意,循环的代码有点复杂,以避免循环超出边界。非常感谢@Gregor在将它们整合在一起的大量帮助。
#Data
df <- data.frame(Price = c( 1221, 1220, 1220, 1217, 1216, 1218 , 1216, 1216, 1217, 1220, 1219, 1218, 1220, 1216, 1217, 1218, 1218, 1207, 1206, 1205))
#Inputs
Window = 6 # check up to this far below current row
IndexDifference = 3 # check row against another this far down
ValueDifference = 2 # for difference at least this big
#Define loop boundaries
base_rows = 1:(nrow(df) - IndexDifference) # can't check more than this
candidate_max = pmin(base_rows + Window, nrow(df) - IndexDifference) # for a given base row, this is the maximum row to start checking against
#Make Signal variable
df$Signal = rep(NA, nrow(df)) #pre-allocate variable
for (i in seq_along(base_rows)) {
df$Signal[i] = as.numeric(
any(
df$Price[(i + 1):candidate_max[i]] -
df$Price[((i + 1):candidate_max[i]) + IndexDifference] > ValueDifference))}
答案 0 :(得分:2)
这有点晚了,但万一它有用。
我同意@alexis_laz,计算的比较比必要的更多。我认为这个想法可以更进一步,因为如果any
以滚动的方式应用,那也会带来不必要的计算。
关键是始终将给定行与另一个特定行(在您的示例中为3)进行比较。一旦我们知道该行的等价是否成立,在给定窗口中包含它的任何其他行应该被赋予值1(TRUE)。
这里有用的快捷方式是,如果行j
的等效性成立,而行i
为TRUE,行j
也在行i+1
的窗口内,然后i+1
也为TRUE(无需知道窗口中其他点的状态)。我所得到的是,我们不需要为每一行的窗口确定any
。如果我们知道行i
窗口中有多少个TRUE,对于行i+1
,我们只需要确定离开窗口的点是否为TRUE以及进入窗口的点是否为真正。基本上我们用一个Window
- 宽度的盒子过滤矢量,然后只检查哪些条目在他们的窗口中至少有一个TRUE值(这可以在一次传递中完成,但让&# 39; s忽略,因为额外的时间并不重要)。
使用滚动总和,我们可以通过运行计数,包括/删除进入/离开窗口的点来有效地计算这一点。这是@ alexis_laz的观察结果:可以预先计算进入/离开的点的状态。
为了使事情更具体,这里有一些代码。首先,让我复制你的原始循环@Richard Telford的回答,以及@ alexis_laz的回答并将它们包装成函数(稍微重写主要是为了方便个人,因此输出格式匹配,以及希望没有添加任何错误):
f_G <- function(x, window, idiff, valdiff){
base_rows = 1:(NROW(x) - idiff - 1) # can't check more than this
candidate_max = pmin(base_rows + window, NROW(x) - idiff) # maximum row to start checking against
out = rep(0, NROW(x)) #pre-allocate variable
for (i in seq_along(base_rows)) {
out[i] = as.numeric(any(x[(i + 1):candidate_max[i]]
- x[((i + 1):candidate_max[i]) + idiff] > valdiff))}
return(out)
}
f_RT <- function(x, window, idiff, valdiff){
x0 <- cbind(x[-(1)][1:NROW(x)], sapply(2:window,
function(i)x[-(1:i)][1:NROW(x)]))
x1 <- sapply((idiff+1):(idiff+window),
function(i)x[-(1:i)][1:NROW(x)])
out <- as.numeric(apply((x0 - x1) > valdiff, 1, any, na.rm = TRUE))
return(out)
}
f_AL <- function(x, window, idiff, valdiff){
check = (x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff
check <- c(check, rep(FALSE, idiff+1))
out <- as.integer(sapply(seq_along(check),
function(i) any(check[i:min(length(check), (i + (window - 1)))])))
return(out)
}
然后,这里有两个函数来计算上面描述的滚动和,在具有预先计算差异的向量上(如@alexis_laz建议)。第一个使用filter
函数,而第二个使用roll_sum
包中的RcppRoll。
f_filt <- function(x, window, idiff, valdiff){
## calculate idiff differences once
check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
## extend series to filter
check <- c(check, rep(0, window+idiff))
## reverse series due to filter using "past" values
ffilt <- rev(filter(rev(check), rep(1, window), sides=1))
## check if at least one
out <- ifelse(na.omit(ffilt) > 0, 1, 0)
return(out)
}
library(RcppRoll)
f_roll <- function(x, window, idiff, valdiff){
## calculate idiff differences once
check = as.integer((x[2:(NROW(x) - idiff)] - x[(2 + idiff):NROW(x)]) > valdiff)
## extend series to filter
check <- c(check, rep(0, window+idiff))
## rolling window sum
froll <- roll_sum(check, n=window, align="right")
out <- ifelse(froll > 0, 1, 0)
return(out)
}
作为一个快速检查,我们可以测试所有功能给出相同的答案:
f_G(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_RT(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_AL(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_filt(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
f_roll(df$Price, Window, IndexDifference, ValueDifference)
# 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0
现在让我们对它们进行基准测试。我还会增加要测试的行数。
library(microbenchmark)
w <- Window
idiff <- IndexDifference
vdiff <- ValueDifference
df2 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df2, w, idiff, vdiff),
f_RT(df2, w, idiff, vdiff),
f_AL(df2, w, idiff, vdiff),
f_filt(df2, w, idiff, vdiff),
f_roll(df2, w, idiff, vdiff)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
f_G(df2, w, idiff, vdiff) 395.80227 412.05120 419.88554 413.55551 417.84907 479.47306 100 e
f_RT(df2, w, idiff, vdiff) 154.43919 192.99473 193.10029 195.61031 197.95933 236.27244 100 c
f_AL(df2, w, idiff, vdiff) 233.30237 244.01664 249.75449 245.07001 248.51249 319.04956 100 d
f_filt(df2, w, idiff, vdiff) 21.53997 22.51582 25.38218 22.59477 23.56873 63.48320 100 b
f_roll(df2, w, idiff, vdiff) 14.26333 14.35543 16.99302 15.24879 15.45127 55.49886 100 a
最后,我们看到我们得到了相当不错的速度提升。以这种方式接近它的另一个巧妙的事情是,无论窗口大小如何,它都保持同样有效(特别是,直接进行滚动总和;使用filter
确实减慢了一点,尽管它有点慢还是很快。)
w <- 25 #Window
df3 <- rep(df$Price, 5000) #100,000 entries
microbenchmark(f_G(df3, w, idiff, vdiff),
f_RT(df3, w, idiff, vdiff),
f_AL(df3, w, idiff, vdiff),
f_filt(df3, w, idiff, vdiff),
f_roll(df3, w, idiff, vdiff)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
f_G(df3, w, idiff, vdiff) 487.65798 516.67700 537.54019 541.34459 551.52128 592.05720 100 e
f_RT(df3, w, idiff, vdiff) 328.44934 366.76176 389.08534 401.39053 409.49376 518.94535 100 d
f_AL(df3, w, idiff, vdiff) 240.99006 258.66045 263.21317 260.09258 263.75917 319.02493 100 c
f_filt(df3, w, idiff, vdiff) 37.32291 37.41098 38.97167 37.47234 38.40989 79.51684 100 b
f_roll(df3, w, idiff, vdiff) 15.49264 15.52950 15.86283 15.55252 15.62852 19.77415 100 a
答案 1 :(得分:1)
这个问题的一个解决方案是构建两个滞后列的矩阵,并从另一个中减去一个。这在R中使用矢量化并且应该很快。
heroku logs
注意,这并没有给出与您的代码完全相同的结果,可能是因为
时df0 <- cbind(df$Price[-(1)][1:nrow(df)], sapply(2:Window, function(i)df$Price[-(1:i)][1:nrow(df)]))
df1 <- sapply((IndexDifference+1):(IndexDifference+Window), function(i)df$Price[-(1:i)][1:nrow(df)])
df$Signal <- as.numeric(apply((df0 - df1) > ValueDifference, 1, any, na.rm = TRUE))
df$Signal
评估为i = 17
(i + 1):candidate_max[i]
,这可能不是您想要的。
答案 2 :(得分:1)
在你的循环中,大多数Price[i] - Price[i + IndexDifference] > ValueDifference
被计算多次;在这种情况下(最后的代码)大多数比较进行了6次:
# [i] [i + IndexDifference] [times calculated]
# Var1 Var2 Freq
#70 2 5 1
#88 3 6 2
#106 4 7 3
#124 5 8 4
#142 6 9 5
#160 7 10 6
#178 8 11 6
#196 9 12 6
#214 10 13 6
#232 11 14 6
#250 12 15 6
#268 13 16 6
#286 14 17 6
#304 15 18 6
#322 16 19 6
#340 17 20 6
另外,我猜,它不仅仅是重复计算本身,而是重复分配(和子集化)到&#34; data.frame&#34; s。
相反,您可以计算差异和比较一次:
tmp = (df$Price[2:(nrow(df) - IndexDifference)] -
df$Price[(2 + IndexDifference):nrow(df)]) > ValueDifference
以滚动的方式申请any
(注意你关于不出界的评论):
as.integer(sapply(seq_along(tmp),
function(i) any(tmp[i:min(length(tmp), (i + (Window - 1)))])))
#[1] 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
#and 4 values are left (rows 17:20 that cannot be
#calculated based on the conditions) to be added as `NA`
比较制表:
#re-calculcated your 'base_rows' to not include row 17 as it exceeds tha 'IndexDifference'
base_rows = 1:(nrow(df) - IndexDifference - 1L)
candidate_max = pmin(base_rows + Window, nrow(df) - IndexDifference)
#set-up the tabulations for each comparison
table_diffs = matrix(0L,
base_rows[length(base_rows)] + 1L,
candidate_max[length(candidate_max)] + IndexDifference)
for(i in seq_along(base_rows)) {
ij = cbind((i + 1):candidate_max[i], ((i + 1):candidate_max[i]) + IndexDifference)
table_diffs[ij] = table_diffs[ij] + 1L
}
#format
subset(transform(as.data.frame(as.table(table_diffs)),
Var1 = as.integer(Var1),
Var2 = as.integer(Var2)),
Freq != 0L)