如何限制执行时间但是将输出保存在R中?

时间:2017-09-26 14:11:54

标签: r machine-learning time limit xgboost

我正在尝试限制分析的执行时间,但是我想保留分析已经完成的工作。 在我的情况下,我正在运行xgb.cv(来自xgboost R包),我希望保持所有迭代,直到分析达到10秒(或“n”秒/分钟/小时)。

我已经尝试了this thread中提到的方法,但它在达到10秒后停止,而没有保留先前完成的迭代。

这是我的代码:

require(xgboost)
require(R.utils)

data(iris)
train.model <- model.matrix(Sepal.Length~., iris)

dtrain <- xgb.DMatrix(data=train.model, label=iris$Sepal.Length)

evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- sqrt(sum((log(preds) -  log(labels))^2)/length(labels))
  return(list(metric = "error", value = err))}

xgb_grid = list(eta = 0.05, max_depth = 5, subsample = 0.7, gamma = 0.3,
  min_child_weight = 1)

fit_boost <- tryCatch(
            expr = {evalWithTimeout({xgb.cv(data  = dtrain,
                  nrounds     = 10000,
                  objective   = "reg:linear",
                  eval_metric = evalerror, 
                  early_stopping_rounds = 300,
                  print_every_n = 100,
                  params = xgb_grid,
                  colsample_bytree = 0.7, 
                  nfold = 5,
                  prediction = TRUE,
                  maximize = FALSE
                  )}, 
                  timeout = 10)
                  },                                        
            TimeoutException = function(ex) cat("Timeout. Skipping.\n"))

,输出

#Error in dim.xgb.DMatrix(x) : reached CPU time limit

谢谢!

1 个答案:

答案 0 :(得分:1)

编辑 - 稍微接近您想要的内容:

用R的capture.output()函数包裹整个事物。这会将所有评估输出存储为R对象。再一次,我认为你正在寻找更多的东西,但这至少是本地的和可塑的。语法:

fit_boost <- capture.output(tryCatch(expr = {evalWithTimeout({...}) ) )
> fit_boost
 [1] "[1]\ttrain-error:2.033160+0.006109\ttest-error:2.034180+0.017467 "  ...

原始答案:

您还可以使用sink。只需在开始交叉验证之前添加此行:

sink("evaluationLog.txt")
fit_boost <- tryCatch(
expr = {evalWithTimeout({xgb.cv(data  = dtrain,
                              nrounds     = 10000,
                              objective   = "reg:linear",
                              eval_metric = evalerror, 
                              early_stopping_rounds = 300,
                              print_every_n = 100,
                              params = xgb_grid,
                              colsample_bytree = 0.7, 
                              nfold = 5,
                              prediction = TRUE,
                              maximize = FALSE
)}, 
timeout = 10)
},                                        
TimeoutException = function(ex) cat("Timeout. Skipping.\n"))
sink()

最后sink()通常会将输出返回到控制台,但在这种情况下它不会因为抛出错误而输出。但是一旦你运行它,你可以打开evaluationLog.txt和中提琴:

[1] train-error:2.033217+0.003705   test-error:2.032427+0.012808 
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 300 rounds.

[101]   train-error:0.045297+0.000396   test-error:0.060047+0.001849 
[201]   train-error:0.042085+0.000852   test-error:0.059798+0.002382 
[301]   train-error:0.041117+0.001032   test-error:0.059733+0.002701 
[401]   train-error:0.040340+0.001170   test-error:0.059481+0.002973 
[501]   train-error:0.039988+0.001145   test-error:0.059469+0.002929 
[601]   train-error:0.039698+0.001028   test-error:0.059416+0.003018 

当然,这并不完美。我想你想对这些进行一些操作,这不是最好的格式。但是,将其转换为更易于管理的东西并不是一个很高的要求。我还没有找到一种方法来在超时之前保存实际的xgb.cv$evaluation_log对象。这是一个非常好的问题。