Question

我试图从R中的模型预测测试数据集。缺少因子，所以我想跳过或替换那些我创建输出时的那些。我想出的是

safe.predict = function(x){
  output = tryCatch({
    predict(best.model, test.data.ffdf[x,])
  }, error = function(e) {
    0.50
  })
  return(output)
}#end safe.predict

output = sapply(1:nrow(test.data.ffdf), safe.predict)

这种方式如果出现错误（即其他因素），我可以用0.5或NA替换该值，甚至使用其他模型。问题是它需要永远。我有一个6M长的数据集，看起来它需要大约7天才能运行（基于较小数据集的计算）。仅使用predict（）运行相同的操作将花费不到一个小时而且没有错误捕获。

那么我做错了什么？该方法看起来像向量化，但运行速度比for（）循环慢。

编辑：从数据集中删除值很困难，因为此时我可以访问模型，但不一定是训练数据集。此外，测试集中可能存在的因素根本不存在于任何可用的训练集中。

Answer 1

问题不是tryCatch，这只会增加最小的开销。为了说明，这是一个衡量生成n错误

所需时间的函数

FUN = function(n) 
    system.time(replicate(n, tryCatch(stop(), error=function(...) NA)))[[3]]

> sapply(10^(0:5), FUN)
## [1] 0.001 0.001 0.005 0.055 0.554 5.642

你可以看到生成10 ^ 5个错误需要大约5.5秒。相反，它有时需要predict很长时间才能执行计算。也许长时间的计算最终会产生错误，也许不会。最好的办法是更好地理解predict，并改进其实现，或者在必要时识别并消除触发长时间运行计算的行。

Answer 2

好的，所以我一直在研究这个问题，并且发现它实际上是问题，而不是tryCatch函数：

tries = 1000

time1 = proc.time()[3]
test1 = sapply(1:tries, safe.predict) #this way is very very slow
time2 = proc.time()[3]
print(paste0('Runtime: ',(time2-time1)/60))

[1]＆＃34;运行时间：1.76163333333333＆＃34;

time1 = proc.time()[3]
test2 = safe.predict(1:tries)# very fast
time2 = proc.time()[3]
print(paste0('Runtime: ',(time2-time1)/60))

[1]＆＃34;运行时间：0.00740000000000028＆＃34;

#be sure that we are coming to the same answers
sum(test1 == test2) / length(test1)

[1] 1

我现在可以看到使用sapply并不是真的有必要，但我认为这就是sapply的重点。如果有人对减速有很好的解释，我仍然希望在评论或其他答案中看到它。

如何加速R中的tryCatch功能？

2 个答案: