Question

我有一个大型语料库，我正在使用tm::tm_map()进行转换。由于我使用托管的R Studio，我有15个内核，并希望利用并行处理来加快速度。

如果没有共享一个非常大的语料库，我就无法使用虚拟数据进行重现。

我的代码如下。对问题的简短描述是在控制台中手动循环，但在我的函数中这样做不会。

函数“clean_corpus”将语料库作为输入，将其分解为片段并保存到临时文件以帮助解决ram问题。然后，该函数使用%dopar％块迭代每个部分。该功能在测试语料库的一小部分时起作用，例如10k文件。但是在较大的语料库中，该函数返回NULL。要调试我设置函数来返回已经循环的各个部分，而不是整个重建的语料库。我发现在较小的语料库样本上，代码会按预期返回所有迷你语料库的列表，但是当我在语料库的较大样本上测试时，函数会返回一些NULL。

这就是为什么这让我感到困惑：

cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works
cleaned.corpus <- clean_corpus(corpus.regular[10001:20000], n = 1000) # also works
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # NULL

如果我在10k块中执行此操作，例如通过5次迭代50k一切正常。如果我在例如完整的50k文档，它返回NULL。

所以，也许我只需要通过更多地破坏我的语料库来循环小块。我试过这个。在clean_corpus函数下面，参数n是每个部分的长度。该函数仍然返回NULL。

所以，如果我像这样迭代：

# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000)

如果我这样做5次手动高达50K一切正常。相当于我的函数在一次调用中执行此操作是：

# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000)

返回NULL。

This SO帖子和链接到唯一答案的那个帖子暗示它可能与我在Linux上托管的RStudio实例有关，其中linux“内存杀手oom”可能会阻止工作者。这就是为什么我试图将我的语料库分解成碎片，以解决内存问题。

为什么在10个1k的块中迭代超过10k的文档，而不是50个1k的块的任何理论或建议呢？

这是clean_corpus函数：

clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
  lenp <- length(pieces)

  rm(corpus) # save memory

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(lenp)) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  rm(pieces) # save memory

  # doparallel
  registerDoParallel(cores = 14) # I've experimented with 2:14 cores
  pieces <- foreach(i = seq_len(lenp)) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # transformations
    piece <- tm_map(piece, content_transformer(replace_abbreviation))
    piece <- tm_map(piece, content_transformer(removeNumbers))
    piece <- tm_map(piece, content_transformer(function(x, ...) 
      qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
  }

  # combine the pieces back into one corpus
  corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
  return(corpus)

} # end clean_corpus function

上面的代码块再次只是为了输入函数后的可读性流程：

# iterate over 10k docs in 10 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:10000], n = 1000) # works

# iterate over 50K docs in 50 chunks of one thousand at a time
cleaned.corpus <- clean_corpus(corpus.regular[1:50000], n = 1000) # does not work

但是通过调用每个

上的函数在控制台中迭代

corpus.regular[1:10000], corpus.regular[10001:20000], corpus.regular[20001:30000], corpus.regular[30001:40000], corpus.regular[40001:50000] # does work on each run

注意我尝试使用库tm功能进行并行处理（请参阅here）但我仍然遇到“无法分配内存”错误，这就是为什么我尝试使用doparallel %dopar%“自己”进行操作

Answer 1

评论解决方案摘要

您的内存问题可能与corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)有关，因为这仍会将所有（输出）数据存储在内存中

我建议将每个工作人员的输出导出到文件，例如RDS或csv文件，而不是将其收集到最后的单个数据结构中

另一个问题（正如您所指出的）是foreach将使用隐含的return语句（{}后dopar中的代码块保存每个工作程序的输出被视为一种功能）。我建议在结束return(1)之前添加一个显式的}，以便不将预期的输出保存到内存中（您已经明确保存为文件）。

doparallel在循环中嵌套循环有效但在逻辑上没有意义吗？

1 个答案: