Question

在R中工作时，我通常会创建一些在执行代码期间保存的中间数据帧。这允许我避免重新计算中间数据帧，如果我需要重新启动脚本或它崩溃。我的代码通常会以很多这些丑陋的if / else检查来查看中间数据框是否已经存在。

data <- NULL
pathToData <- "work/data.rds"
if(file.exists(pathToData)) {
   # load the previously calculated data
   data <- readRDS(pathToData)

} else { 
   # calculate the data
   data <- ...  
   saveRDS(data, pathToData)
}

有更好/更简单的方法吗？理想情况下，这可以通过代码中透明的方式完成。

Answer 1

一种选择是将丑陋的代码包装在一个函数中，并将中间步骤包装在其他函数中。这样做的好处是可以使您的测试更容易，并且使用脚本上的函数被认为是可重复数据分析的最佳实践。

calcData <- function(...) {
  #calculate the data
}

lazyCalc <- function(fn, ...) {
  if(file.exists(fn)) {
    data <- readRDS(fn)
  } else {
     calcData(...)
  return(data)
}

Answer 2

一种选择是使用带有缓存的knitr包。

您可以创建一个完整的knitr模板文件，其中包含您的脚本和其他内容，并设置您不想重新运行的块被缓存，然后只有当该块中的代码发生更改时它们才会再次运行

您还可以在脚本文件中使用knitr中的spin函数，然后knitr将查看特殊格式的注释以设置knitr选项（其他所有内容将基本上视为常规脚本文件）。我没有尝试使用spin设置缓存信息，但它可能适合您。

Answer 3

德鲁斯蒂恩的回答非常接近。我将他的函数定义与他建议使用的eval（）配对。这正是我所寻找的。

cache <- function(cacheName, expr, cacheDir="work", clearCache=F) {
    result <- NULL

    # create the cache directory, if necessary
    dir.create(path=cacheDir, showWarnings=FALSE)
    cacheFile <- sprintf("%s/%s.rds", cacheDir, cacheName)

    # has the result already been cached?
    if(file.exists(cacheFile) && clearCache==F) {
        result <- readRDS(cacheFile)

    # eval the expression and cache its result
    } else {
        result <- eval(expr)
        saveRDS(result, cacheFile)
    }

    return(result)
}

这允许我缓存单个函数调用...

result <- cache("foo", foo())

或更复杂的表达式/代码块...

results <- cache("foo", {
   f <- foo()
   r <- f + 2
   return(r)
})

如何避免重新计算R中的数据

3 个答案: