我有几个具有相同结构的不同txt文件。现在我想用fread将它们读入R,然后将它们组合成一个更大的数据集。
## First put all file names into a list
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")
## Read data using fread
readdata <- function(fn){
dt_temp <- fread(fn, sep=",")
keycols <- c("ID", "date")
setkeyv(dt_temp,keycols) # Notice there's a "v" after setkey with multiple keys
return(dt_temp)
}
# then using
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)
代码工作正常,但速度不理想。每个txt文件有1M个观察值和12个字段。
如果我使用fread
来读取单个文件,那就快了。但是使用apply
,速度非常慢,显然比逐个读取文件需要花费很多时间。我想知道这里出了什么问题,速度增益是否有任何改进?
我在llply
包中尝试了plyr
,速度没有太大提升。
此外,data.table
中是否有任何语法可以在rbind
中实现union
和sql
等垂直连接?
感谢。
答案 0 :(得分:40)
使用专为rbindlist()
rbind
list
个data.table
设计的mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )
...
l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )
正如 @Roland 所说,不要在函数的每次迭代中设置键!
总而言之,这是最好的:
{{1}}
答案 1 :(得分:1)
我已经多次重新编写代码来执行此操作.. 终于将它变成了一个方便的函数,如下所示。
data.table_fread_mult <- function(filepaths = NULL, dir = NULL, recursive = FALSE, extension = NULL, ...){
# fread() multiple filepaths and then combine the results into a single data.table
# This function has two interfaces: either
# 1) provide `filepaths` as a character vector of filepaths to read or
# 2) provide `dir` (and optionally `extension` and `recursive`) to identify the directory to read from
# ... should be arguments to pass on to fread()
if(!is.null(filepaths) & (!is.null(dir) | !is.null(extension))){
stop("If `filepaths` is given, `dir` and `extension` should be NULL")
} else if(is.null(filepaths) & is.null(dir)){
stop("If `filepaths` is not given, `dir` should be given")
}
# If filepaths isn't given, build it from dir, recursive, extension
if(is.null(filepaths)){
filepaths <- list.files(
path = dir,
full.names = TRUE,
recursive = recursive,
pattern = paste0(extension, "$")
)
}
# Read and combine files
return(rbindlist(lapply(filepaths, fread, ...), use.names = TRUE))
}