加快加载并在R中的一个大矩阵中组合多个.csv文件

时间:2016-11-24 10:29:55

标签: r csv

我在这里关注一些帖子 How to combine multiple .csv files in R? 和这里 Reading Many CSV Files at the Same Time in R and Combining All into one dataframe

我的目的基本相同:在R中组合成一个大矩阵倍数,非常大的csv文件 我有这个解决方案,我希望尽可能加快速度:

这是一个完全可重复的例子;我有更多更大的文件

 setwd("C:/") #### set an easy directory to create acceptably large files
 #### this takes about 60 seconds
 for(i in 1:80){
   print(80-i)
   write.table(matrix(rnorm(20*3891,0,1),ncol=20),col.names=F,row.names=F,sep=",",file=paste(i,"file.csv",sep=""))
 }
 listfiles<-list.files(path="C:/",pattern="*.csv")
 #### now the problem: this takes about 30-40 seconds; as I have bigger (and much more) files I want to speed up this step
 library(plyr)
 mybigmatrix<-ldply(listfiles,read.csv,header=F)

提前感谢您提供任何帮助

3 个答案:

答案 0 :(得分:0)

可能使用特殊的包和函数,如readr和函数read_csv()

mybigmatrix<-ldply(listfiles,readr::read_csv,header=F)

答案 1 :(得分:0)

这是一个完全可重现的例子,它显示了fread()的一个问题,它不允许我强制在data.table对象的矩阵中。

 setwd("C:/") #### set an easy directory to create acceptably large files
 #### this takes few seconds
 for(i in 1:5){
   print(5-i)
   write.table(matrix(rnorm(5*3891,0,1),nrow=5),col.names=F,row.names=F,sep=",",file=paste(i,"file.csv",sep=""))
 }
 listfiles<-list.files(path="C:/",pattern="*.csv")


 myfread<-function(file){
 data_frame <- fread(file,sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1:3891),colClasses=c(rep("as.numeric",3891)))
 data_frame
 }

    ######  this is a matrix 25*3891 I want an array of 1297x3x25
    alld<-rbindlist(lapply(listfiles,myfread)) 
    ### why this is in characters??
     as.matrix(alld)
    k<-1297
     m<-3
    vectorr<-as.vector(t(as.matrix(alld)))
    tem <- vectorr
    n <- length(tem)/(k * m)
   tem <- array(tem, c(m, k, n))
   tem <- aperm(tem, c(2, 1, 3))
   xup <- tem #######  here I have characters

答案 2 :(得分:0)

我认为这些选项中的任何一个都适合你。

setwd("C:/Users/your_path_here/test")
fnames <- list.files()
csv <- lapply(fnames, read.csv)
result <- do.call(rbind, csv)
filedir <- setwd("C:/Users/your_path_here/csv_files")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind,lapply(file_names,read.csv))
filedir <- setwd("C:/Users/your_path_here/csv_files")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind, lapply(file_names, read.csv, skip = 1, header = FALSE))
filedir <- setwd("C:/Users/your_path_here/csv_files")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind, lapply(file_names, read.csv, header = FALSE))
temp <- setwd("C:/Users/Excel/Desktop/test")
temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim)

最后,试试这个:

setwd("C:/Users/your_path_here/")

file_list <- list.files()

file_list <- list.files("C:/Users/your_path_here/")

for (file in file_list){

  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }

  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }

}