逐块读取大文本文件

时间:2018-10-07 18:32:03

标签: r

我正在使用有限的RAM(AWS免费层EC2服务器-1GB)。

我有一个相对较大的txt文件“ vectors.txt”(800mb),我正在尝试读入R。尝试了各种方法后,我都无法将该向量读入内存。

因此,我正在研究以大块读取的方式。我知道结果数据帧的暗淡应该是300K *300。如果我能够读取文件,例如一次10K行,然后将每个块保存为RDS文件,我将能够遍历结果并获得所需的内容,尽管比将整个内容存储在内存中要慢一些,但也没有那么方便。

要复制:

# Get data
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)

# word2vec r library
library(rword2vec)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")

到目前为止,一切都很好。这是我挣扎的地方:

word_vectors = as.data.frame(read.table("vector.txt",skip = 1, nrows = 10))

返回“无法分配大小为[size]的向量”错误消息。

尝试过的替代方法:

word_vectors <- ff::read.table.ffdf(file = "vector.txt", header = TRUE)

相同,内存不足

word_vectors <- readr::read_tsv_chunked("vector.txt", 
                                        callback = function(x, i) saveRDS(x, i),
                                        chunk_size = 10000)

结果:

Parsed with column specification:
cols(
  `299567 300` = col_character()
)
|=========================================================================================| 100%  817 MB
Error in read_tokens_chunked_(data, callback, chunk_size, tokenizer, col_specs,  : 
  Evaluation error: bad 'file' argument.

还有其他方法可以将vectors.txt转换为数据帧吗?也许通过将其分成几部分并读取每一部分,另存为数据帧然后保存为rds?或其他替代方法?

编辑: 根据乔纳森(Jonathan)的以下回答,尝试过:

library(rword2vec)
library(RSQLite)

# Download pre trained Google News word2vec model (Slimmed down version)
# https://github.com/eyaler/word2vec-slim
url <- 'https://github.com/eyaler/word2vec-slim/blob/master/GoogleNews-vectors-negative300-SLIM.bin.gz?raw=true'
file <- "GoogleNews-vectors-negative300-SLIM.bin.gz"
download.file(url, file) # takes a few minutes
R.utils::gunzip(file)
w2v_gnews <- "GoogleNews-vectors-negative300-SLIM.bin"
bin_to_txt(w2v_gnews,"vector.txt")


# from https://privefl.github.io/bigreadr/articles/csv2sqlite.html
csv2sqlite <- function(tsv,
                       every_nlines,
                       table_name,
                       dbname = sub("\\.txt$", ".sqlite", tsv),
                       ...) {

  # Prepare reading
  con <- RSQLite::dbConnect(RSQLite::SQLite(), dbname)
  init <- TRUE
  fill_sqlite <- function(df) {

    if (init) {
      RSQLite::dbCreateTable(con, table_name, df)
      init <<- FALSE
    }

    RSQLite::dbAppendTable(con, table_name, df)
    NULL
  }

  # Read and fill by parts
  bigreadr::big_fread1(tsv, every_nlines,
                       .transform = fill_sqlite,
                       .combine = unlist,
                       ... = ...)

  # Returns
  con
}

vectors_data <- csv2sqlite("vector.txt", every_nlines = 1e6, table_name = "vectors")

结果:

Splitting: 12.4 seconds.

 Error: nThread >= 1L is not TRUE

1 个答案:

答案 0 :(得分:1)

另一种选择是在磁盘上进行处理,例如使用SQLite文件和dplyr的数据库功能。这是一个选项:https://stackoverflow.com/a/38651229/4168169

要将CSV导入SQLite,您还可以使用bigreadr软件包,其中包含有关如何执行此操作的文章:https://privefl.github.io/bigreadr/articles/csv2sqlite.html