Question

我有一组数据（大约50000个数据，每个数据都是1.5 MB）。因此，要加载数据并首先处理数据，我已使用此代码;

data <- list() # creates a list
listcsv <- dir(pattern = "*.txt") # creates the list of all the csv files in the directory

然后我用for循环来加载每个数据;

for (k in 1:length(listcsv)){
data[[k]]<- read.csv(listcsv[k],sep = "",as.is = TRUE, comment.char = "",    skip=37);                                                                                                                                                       
my<- as.matrix(as.double(data[[k]][1:57600,2]));


print(ort_my);

a[k]<-ort_my;

write(a,file="D:/ddd/ads.txt",sep='\t',ncolumns=1)}

所以，我设定程序运行，但即使在6小时后它还没有完成。虽然我有一台不错的电脑，配备32 GB内存和6核CPU。

我搜索了论坛，也许fread函数会对人们说有用。但是，到目前为止，我发现的所有示例都涉及使用fread函数读取单个文件。

任何人都可以建议我解决这个问题，以便更快地循环读取数据并使用这么多行和列处理它吗？

Answer 1

我猜想必须有一种方法来提取你想要的效率更高的东西。但我认为并行运行可以为您节省大量时间。通过不存储每个文件来节省你的记忆。

library("data.table")

#Create function you want to eventually loop through in parallel
readFiles <- function(x) {
   data <- fread(x,skip=37)
   my <- as.matrix(data[1:57600,2,with=F]);
   mesh <- array(my, dim = c(120,60,8));
   Ms<-1350*10^3    # A/m
   asd2=(mesh[70:75,24:36 ,2])/Ms;     # in A/m

   ort_my<- mean(asd2);
   return(ort_my)
}


#R Code to run functions in parallel

library(“foreach”);library(“parallel”);library(“doMC”)
detectCores() #This will tell you how many cores are available
registerDoMC(8) #Register the parallel backend

#Can change .combine from rbind to list
OutputList <- foreach(listcsv,.combine=rbind,.packages=c(”data.table”)) %dopar% (readFiles(x))

registerDoSEQ() #Very important to close out parallel backend.

制作函数并应用于读取R中的数据？

1 个答案: