Question

TL;博士

我的问题：在R会话中，是否有某种方法可以使用 knitr 的缓存结果来“快进”到环境中（即在knit()本身具有相同意义上的给定代码块中可用的对象？

设定：

knitr 内置的代码块缓存是其杀手级功能之一。

当某些块包含耗时的计算时，它尤其有用。除非它们（或它们所依赖的块）被更改，否则只需要在文档第一次knit时执行计算：在对knit的所有后续调用中，由块创建的对象将从缓存中加载。

这是一个最小的例子，一个名为"lotsOfComps.Rnw"的文件：

\documentclass{article}
\begin{document}

The calculations in this chunk take a looooong time.

<<slowChunk, cache=TRUE>>=
Sys.sleep(30)  ## Stands in for some time-consuming computation
x <- sample(1:10, size=2)
@

I wish I could `fast-forward' to this chunk, to view the cached value of 
\texttt{x}

<<interestingChunk>>=
y <- prod(x)^2
y
@

\end{document}

时间需要编织和TeXify "lotsOfComps.Rnw"：

## First time
system.time(knit2pdf("lotsOfComps.Rnw"))
##   user  system elapsed
##   0.07    0.02   31.81

## Second (and subsequent) runs
system.time(knit2pdf("lotsOfComps.Rnw"))
##   user  system elapsed
##   0.03    0.02    1.28

我的问题：

在R会话中，是否有某种方法可以使用 knitr 的缓存结果来“快进”到给定代码块中可用的环境（即对象集）， knit()本身的意义相同吗？

执行purl("lotsOfComps.Rnw")然后在"lotsOfComps.R"中运行代码不起作用，因为必须重新计算沿途的所有对象。

理想情况下，可以做到这样的事情，最终进入<<interestingChunk>>=开头的环境：

spin("lotsOfComps.Rnw", chunk="interestingChunk")
ls()
# [1] "x"
x
# [1] 3 8

由于spin()不可用（还有？），获得相同结果的最佳方法是什么？

Answer 1

这一定是我一段时间写过的最丑陋的克拉之一......

基本思想是扫描.Rnw文件中的块，提取它们的名称，检测哪些缓存，然后确定需要加载哪些。一旦我们这样做，我们逐步扫描需要加载的每个块名称，从缓存文件夹中检测数据库名称，并使用lazyLoad加载它。在我们加载所有块之后，我们需要强制评估。丑陋，我确信有一些错误，但它似乎适用于您给出的简单示例和我创建的一些其他最小示例。这假设.Rnw文件位于当前工作目录中......

load_cache_until <- function(file, chunk, envir = parent.frame()){
    require(knitr)

    # kludge to detect chunk names, which come before the chunk of
    # interest, and which are cached... there has to be a nicer way...
    text <- readLines(file)
    chunks <- grep("^<<.*>>=", text, value = T)
    chunknames <- gsub("^<<([^,>]*)[,>]*.*", "\\1", chunks)
    #detect unnamed chunks
    tmp <- grep("^\\s*$", chunknames)
    chunknames[tmp] <- paste0("unnamed-chunk-", seq_along(tmp))
    id <- which(chunk == chunknames)
    previouschunks <- chunknames[seq_len(id - 1)]
    cachedchunks <- chunknames[grep("cache\\s*=\\s*T", chunks)]

    # These are the names of the chunks we want to load
    extractchunks <- cachedchunks[cachedchunks %in% previouschunks]

    oldls <- ls(envir, all = TRUE)
    # For each chunk...
    for(ch in extractchunks){   
        # Detect the file name of the database...
        pat <- paste0("^", ch, ".*\\.rdb")
        val <- gsub(".rdb", "", dir("cache", pattern = pat))
        # Lazy load the database
        lazyLoad(file.path("cache", val), envir = envir)
    }
    # Detect the new objects added
    newls <- ls(envir, all = TRUE)
    # Force evaluation...  There is probably a better way
    # to do this too...
    lapply(setdiff(newls, oldls), get)

    invisible()

}

load_cache_until("lotsOfComps.Rnw", "interestingChunk")

让代码更健壮，留给读者练习。

Answer 2

这是一个解决方案，它仍然有点尴尬，但它的工作原理。我们的想法是添加一个名为mute的块选项，默认情况下需要NULL，但它也可以采用R表达式，例如mute_later()以下{。}}当knitr评估块选项时，可以评估mute_later()并返回NULL;与此同时，opts_chunk中存在副作用（设置全局块选项，如eval = FALSE）。

现在您需要做的是将mute=mute_later()放入要跳过剩余部分的块中，例如您可以将此选项从example-a移至example-b。由于mute_later()会返回NULL，这恰好是mute选项的默认值，因此即使您移动此选项，缓存也不会被破坏。

\documentclass{article}
\begin{document}

<<setup, include=FALSE, cache=FALSE>>=
rm(list = ls(all.names = TRUE), envir = globalenv())
opts_chunk$set(cache = TRUE) # enable cache to make it faster
opts_chunk$set(eval = TRUE, echo = TRUE, include = TRUE)

# set global options to mute later chunks
mute_later = function() {
  opts_chunk$set(cache = FALSE, eval = FALSE, echo = FALSE, include = FALSE)
  NULL
}
# a global option mute=NULL so that using mute_later() will not break cache
opts_chunk$set(mute = NULL)
@

<<example-a, mute=mute_later()>>=
x = rnorm(4)
Sys.sleep(5)
@

<<example-b>>=
y = rpois(10,5)
Sys.sleep(5)
@

<<example-c>>=
z = 1:10
Sys.sleep(3)
@

\end{document}

从某种意义上说，你必须剪切并粘贴, mute=mute_later()。理想情况下，您应该像我为Barry写的那样设置块标签。

我的原始要点不起作用的原因是因为缓存块时忽略了块钩子。第二次knit()文件时，跳过checkpoint的块挂钩example-a，因此其余块的eval=TRUE，并且您看到所有块都已被评估。相比之下，总是动态评估块选项。

Answer 3

Yihui points to a gist接近完全按照我的要求行事。

在回答Barry Rowlingson（又名Spacedman）提出的问题时，Yihui构建了一个'checkpoint'钩子，让用户设置最后一个块的名称，该块将通过编织来调用。要通过名为example-a的块处理块，只需在初始“设置”块中的某处执行opts_chunk$set(checkpoint = 'example-a')。

解决方案非常有效 - 第一次使用给定的检查点运行。不幸的是，第二次及以后的时间knit似乎忽略了检查点并处理了所有的块。（我在下面讨论一种解决方法，但这并不理想）。

以下是Yihui's gist的略有删节版本：

\documentclass{article}
\begin{document}

<<setup, include=FALSE>>=
rm(list = ls(all.names = TRUE), envir = globalenv())
opts_chunk$set(cache = TRUE) # enable cache to make it faster
opts_chunk$set(eval = TRUE, echo = TRUE, include = TRUE)

# Define hook that will skip all chunks after the one named in checkpoint
knit_hooks$set(checkpoint = function(before, options, envir) {
if (!before && options$label == options$checkpoint) {
opts_chunk$set(cache = FALSE, eval = FALSE, echo = FALSE, include = FALSE)
}
})

## Set the checkpoint
opts_chunk$set(checkpoint = 'example-a') # restore objects up to example-a
@

<<example-a>>=
x = rnorm(4)
@

<<example-b>>=
y = rpois(10,5)
@

<<example-c>>=
z = 1:10
@

\end{document}

因为checkpoint="example-a"，上面的脚本应该贯穿第二个块，然后禁止所有其他块，包括创建y和z的块。让我们尝试几次看看会发生什么：

library(knitr)

## First time, works like a charm
knit("checkpoint.Rnw")
ls()
[1] "x"

## Second time, Oops!, runs right past the checkpoint
knit("checkpoint.Rnw")
ls()
[1] "x" "y" "z"

我在上面提到的解决方法是在第一次运行之后到

修改checkpoint.Rnw以设置另一个检查点（通过执行，例如opts_chunk$set(checkpoint = 'example-b')）
运行knit("checkpoint.Rnw")，
修改checkpoint.Rnw将检查点设置回'example-a，（通过执行，opts_chunk$set(checkpoint = 'example-a')）
再次运行knit("checkpoint.Rnw)。这将再次处理所有块，但不会超过example-a。

这比重新计算块中的所有对象要快得多，所以即使它不理想也很好知道。

Answer 4

如何在降价文件的底部添加以下代码块？

```{r save_workspace_if_not_saved_yet, echo=FALSE}
if(!file.exists('knitr_session.RData')) {
  save.image(file = 'knitr_session.RData')
}
```

第一次编织时，将保存过程结束时的工作空间状态（假设过程不会产生任何错误）。每次需要最新版本的工作区时，只需删除工作目录中的文件即可。

Answer 5

它们就像save生成的任何数据文件一样。如果您抓住knitr-cache示例from it's new location，它只是：

> library(knitr)
> knit("./005-latex.Rtex")
> load("cache/latex-my-cache_d9835aca7e54429f59d22eeb251c8b29.RData")
> ls()
 [1] "x"

如何使用knitr缓存结果来重现给定块中的环境？

TL;博士

设定：

我的问题：

5 个答案: