我正在计算两个数据集之间的相关性,但由于数据量大(10 GB),而我的RAM只有6 GB,我面临内存问题。我想知道怎么能把代码分块?
dir1 <- list.files("D:sdr", "*.bin", full.names = TRUE)
dir2 <- list.files("D:dsa", "*.img", full.names = TRUE)
file_tot<-array(dim=c(1440,720,664,2))
for(i in 1:length(dir1)){
file_tot[,,i,1] <- readBin(dir1[i], numeric(), size = 4 ,n = 1440 * 720 , signed = T)
file_tot[,,i,2] <- readBin(dir2[i], integer(), size = 2 ,n = 1440 * 720 , signed = F)
file_tot[,,i,2] <- file_tot[,,i,2]*0.000030518594759971
file_tot[,,i,2][file_tot[,,i,2] == 9999 ] <- NA
}
result<-apply(file_tot,c(1,2),function(x){cor(x[,1],x[,2])})
但得到了这个错误:
Error: cannot allocate vector of size 10.3 Gb
In addition: Warning messages:
1: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
2: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
3: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
4: In file_tot[, , i, 1] <- readBin(dir1[i], numeric(), size = 4, n = 1440 * :
Reached total allocation of 16367Mb: see help(memory.size)
答案 0 :(得分:1)
处理大数据时非常常见的问题。幸运的是,有几种解决方案:
您可能觉得有用的链接:
difference between ff and filehash package in R
In R which packages for loading larger data quickly
Example of bigmemory and friends with file backing
Work in R with very large data set
此外,I'd suggest you did this, but I did it for you.
希望一些研究可以解决这个问题! Gook Luck!
答案 1 :(得分:1)
如果您只计算此相关性,则不需要切换到ff
或bigmemory
这样的包。您可以只以块的形式处理文件。当您计划使用其中一个大数据包进行更多分析时可能会有用。
以下是一个如何以chunkwise方式处理文件的示例:
# Generate some data; in this case I only use 7 columns,
# but it should scale to any number of columns (except
# perhaps generating the files)
dim <- c(1440, 7, 664, 2)
# The last line should be replaced by the next for the data in
# the question
# dim <- c(1440, 770, 664, 2)
for (i in seq_len(dim[3])) {
dat <- rnorm(dim[1]*dim[2])
writeBin(dat, paste0("file", i, ".bin"), size = 4)
dat <- rnorm(dim[1]*dim[2])
writeBin(dat, paste0("file", i, ".img"), size = 4)
}
dir1 <- list.files("./", "*.bin", full.names = TRUE)
dir2 <- list.files("./", "*.img", full.names = TRUE)
result <- array(dim=c(dim[1], dim[2]))
file_tot<-array(dim=c(dim[1], dim[3], dim[4]))
# Proces the files column by column
for (j in seq_len(dim[2])) {
for(i in 1:length(dir1)){
# Open first file
con <- file(dir1[i], 'rb')
# Skip to the next column
seek(con, (j-1)*dim[1]*4)
# Read colum
file_tot[,i,1] <- readBin(con, numeric(), size = 4 ,n = dim[1])
close(con)
# And repeat for the next file
con <- file(dir2[i], 'rb')
seek(con, (j-1)*dim[1]*4)
file_tot[,i,2] <- readBin(con, numeric(), size = 4 ,n = dim[1])
# For the datasets in the example the previous line should be replaced
# by the next three:
#file_tot[,i,2] <- readBin(con, integer(), size = 2 ,n = dim[1] , signed = F)
#file_tot[,i,2] <- file_tot[,i,2]*0.000030518594759971
#file_tot[,i,2][file_tot[,i,2] == 9999 ] <- NA
close(con)
}
result[,j] <-apply(file_tot,c(1),function(x){cor(x[,1],x[,2])})
}