Question

类似的Q可能是R: Applying readRDS to a list object of .Rds file names，但是，该解决方案的效率并不比for loop高得多。

在一个文件夹中，我存储了名为500，.rds files，...，file_1.rds的{{1}} file_2.rds。每个文件都包含大约file_500.rds和200 records，它们是大6 variables的小块。

data.frame

您对更有效的方法或如何改进代码有任何建议吗？此外，由于我创建了这些mydf <- data.frame() for (m in 1 : 500) { temp <- readRDS(paste0("H://myfolder//file_",m, ".rds")) mydf<- rbind(mydf, temp) }，因此我愿意改进500 .rds files流程，例如保存write或其他任何比{{1}更有效阅读的格式}。

Answer 1

我找到了使用purrr的解决方案问题是要有数千.rds个文件可供阅读parallelized，我需要使用loop来并行读取小块。
否则，我收到内存错误，进度丢失。

mydf<- readRDS("H://folder//mydf.rds")
#Create a vector of string with the names of all rds files to read
rds <- paste0("H://folder//myrds", 1:3870, ".rds")
#Determine the number of iteration to read the files in chunks by 200 each
n <- ceiling(length(rds) / 200)
m <- 1
library(purrr)
while (m <= n) {
#if the loop is **not** in the last iteration
        if(m < n) {
                rds_temp <- paste0("H://folder//myrds", (200*(m-1)+1):(200*m), ".rds")
                temp <- purrr::map_df(rds_temp, readRDS)
                mydf<- rbind(mydf, temp)
#if the loop **is** in the last iteration
        } else if(m == n) {
                rds_temp <- paste0("H://folder//myrds", (200*(m-1)+1):(length(rds)), ".rds")
                temp <- purrr::map_df(rds_temp, readRDS)
                mydf<- rbind(mydf, temp)
        }
        rm(temp)
        gc()
        print(m)
        m <- m + 1
}
saveRDS(mydf, "H://folder//mydf.rds")

并行读取500个独立的小.rds文件到单个数据框架

1 个答案: