Question

我有一个原始文本文件，重70GB，超过1B行不同长度，不涉及任何列，原始文本。

我希望对其进行扫描，并简单地计算预定义集search_words中每个单词出现的次数（大小约为100）。目前，我正在使用read_lines_chunked包中的readr，读取10万行代码，并调用callable函数f，该函数将更新全局counter，如下所示：

library(tidyverse)

write_lines("cat and dog\r\ndog\r\nowl\r\nowl and cat", "test.txt")

search_words <- c("cat", "dog", "owl") # real size is about 100

counter <- numeric(length(search_words))

regex_word <- function(w) str_c("\\b", w, "\\b")

search_words <- map_chr(search_words, regex_word)

count_word <- function(i, chunk) sum(str_count(chunk, search_words[i]))

f <- function(x, pos) {
  counter <<- counter + map_int(1:length(search_words), count_word, x)
}

read_lines_chunked("test.txt", SideEffectChunkCallback$new(f), chunk_size = 100000)

这很好用，如果我的8核Windows 10 16GB RAM笔记本电脑一次完成工作，还不到24小时就还不错。但是时间至关重要。是否存在涉及文本而不是列表CSV（例如data.table的{{1}}）的解决方案，以便在一台笔记本电脑上快速完成此任务？最好是fread优雅的东西。

我想过的可能解决方案，但无法使它们与原始文本或分块一起使用：

read_lines_chunked程序包
ff程序包
只需通过bigmemory调用命令行并使用system()进行计数-我有什么理由相信这样做会更快吗？
并行化？不确定是否可以在Windows中使用。

一台笔记本电脑上70 GB的文本文件中的快速（特定）字数统计

0 个答案: