Question

我有几个具有相同结构的不同txt文件。现在我想用fread将它们读入R，然后将它们组合成一个更大的数据集。

## First put all file names into a list 
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")

## Read data using fread
readdata <- function(fn){
    dt_temp <- fread(fn, sep=",")
    keycols <- c("ID", "date")
    setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys
    return(dt_temp)

}
# then using 
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)

代码工作正常，但速度不理想。每个txt文件有1M个观察值和12个字段。

如果我使用fread来读取单个文件，那就快了。但是使用apply，速度非常慢，显然比逐个读取文件需要花费很多时间。我想知道这里出了什么问题，速度增益是否有任何改进？

我在llply包中尝试了plyr，速度没有太大提升。

此外，data.table中是否有任何语法可以在rbind中实现union和sql等垂直连接？

感谢。

Answer 1

使用专为rbindlist() rbind list个data.table设计的mylist <- lapply(all.files, readdata) mydata <- rbindlist( mylist ) ...

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )

正如 @Roland 所说，不要在函数的每次迭代中设置键！

总而言之，这是最好的：

{{1}}

Answer 2

我已经多次重新编写代码来执行此操作.. 终于将它变成了一个方便的函数，如下所示。

data.table_fread_mult <- function(filepaths = NULL, dir = NULL, recursive = FALSE, extension = NULL, ...){
  # fread() multiple filepaths and then combine the results into a single data.table
  # This function has two interfaces: either
  # 1) provide `filepaths` as a character vector of filepaths to read or 
  # 2) provide `dir` (and optionally `extension` and `recursive`) to identify the directory to read from
  # ... should be arguments to pass on to fread()
  
  if(!is.null(filepaths) & (!is.null(dir) | !is.null(extension))){
    stop("If `filepaths` is given, `dir` and `extension` should be NULL")
  } else if(is.null(filepaths) & is.null(dir)){
    stop("If `filepaths` is not given, `dir` should be given")
  }
  
  # If filepaths isn't given, build it from dir, recursive, extension
  if(is.null(filepaths)){
    filepaths <- list.files(
      path = dir, 
      full.names = TRUE, 
      recursive = recursive, 
      pattern = paste0(extension, "$")
    )
  }
  
  # Read and combine files
  return(rbindlist(lapply(filepaths, fread, ...), use.names = TRUE))
}

使用data.table快速读取和组合多个文件（带有fread）

2 个答案: