Question

我有一个大的json文件，大小超过2GB。由于数据量非常大，我无法使用整个数据集创建数据框。我想要解析特定信息并写入CSV文件。

所以我正在寻找一些技术来创建具有特定行数的数据帧。

假设我将json解析为数据帧时有2M行，我想创建一个每个进程只有 10k-15k 行的数据帧。然后将一些信息写入CSV文件。

每个流程都有 10k-15k 行，直到完成所有2M行。

我正在使用 tidyjson 和 dplyr 包。

Answer 1

我建议将大文件拆分为较小的文件并与之并行：

 library(parallel)
 json_files<-list.files(path = "path/to/jsons",pattern="*.json",full.names = TRUE)#get the files' location

 no_cores <- detectCores() - 1
 registerDoParallel(cores=no_cores)  
 cl <- makeCluster(no_cores)

system.time(json_list<-parLapply(cl,json_files,function(x) rjson::fromJSON(file=x,method = "R")))

  stopCluster(cl)#Once we are done we need to close the cluster so that resources such as memory are returned to the operating system.
  gc()#just a garbage collection call.

您现在拥有一个包含整个导入的信息的列表。

使用R创建具有特定行数的数据框

1 个答案: