我有一个大文本文件(> 1000万行,> 1 GB),我希望一次处理一行,以避免将整个内容加载到内存中。处理完每一行后,我希望将一些变量保存到big.matrix
对象中。这是一个简化的例子:
library(bigmemory)
library(pryr)
con <- file('x.csv', open = "r")
x <- big.matrix(nrow = 5, ncol = 1, type = 'integer')
for (i in 1:5){
print(c(address(x), refs(x)))
y <- readLines(con, n = 1, warn = FALSE)
x[i] <- 2L*as.integer(y)
}
close(con)
x.csv
包含
4
18
2
14
16
按照这里的建议http://adv-r.had.co.nz/memory.html我打印了big.matrix
对象的内存地址,它似乎随着每次循环迭代而改变:
[1] "0x101e854d8" "2"
[1] "0x101d8f750" "2"
[1] "0x102380d80" "2"
[1] "0x105a8ff20" "2"
[1] "0x105ae0d88" "2"
可以对big.matrix
个对象进行修改吗?
有没有更好的方法来加载,处理然后保存这些数据?目前的方法很慢!
答案 0 :(得分:2)
- 有没有更好的方法来加载,处理然后保存这些数据?目前的方法很慢!
醇>
您方法中最慢的部分是调用单独读取每一行。我们可以“分块”数据,或者一次读取几行,以便不会达到内存限制,同时可能加快速度。
这是计划:
将该块重新推送到新文件中以便以后保存
library(readr)
# Make a file
x <- data.frame(matrix(rnorm(10000),100000,10))
write_csv(x,"./test_set2.csv")
# Create a function to read a variable in file and double it
calcDouble <- function(calc.file,outputFile = "./outPut_File.csv",
read.size=500000,variable="X1"){
# Set up variables
num.lines <- 0
lines.per <- NULL
var.top <- NULL
i=0L
# Gather column names and position of objective column
connection.names <- file(calc.file,open="r+")
data.names <- read.table(connection.names,sep=",",header=TRUE,nrows=1)
close(connection.names)
col.name <- which(colnames(data.names)==variable)
#Find length of file by line
connection.len <- file(calc.file,open="r+")
while((linesread <- length(readLines(connection.len,read.size)))>0){
lines.per[i] <- linesread
num.lines <- num.lines + linesread
i=i+1L
}
close(connection.len)
# Make connection for doubling function
# Loop through file and double the set variables
connection.double <- file(calc.file,open="r+")
for (j in 1:length(lines.per)){
# if stops read.table from breaking
# Read in a chunk of the file
if (j == 1) {
data <- read.table(connection.double,sep=",",header=FALSE,skip=1,nrows=lines.per[j],comment.char="")
} else {
data <- read.table(connection.double,sep=",",header=FALSE,nrows=lines.per[j],comment.char="")
}
# Grab the columns we need and double them
double <- data[,I(col.name)] * 2
if (j != 1) {
write_csv(data.frame(double),outputFile,append = TRUE)
} else {
write_csv(data.frame(double),outputFile)
}
message(paste0("Reading from Chunk: ",j, " of ",length(lines.per)))
}
close(connection.double)
}
calcDouble("./test_set2.csv",read.size = 50000, variable = "X1")
因此我们返回带有操纵数据的.csv文件。您可以将double <- data[,I(col.name)] * 2
更改为您需要对每个块执行的任何操作。