将大文件分区为R中的小文件

时间:2018-05-13 05:48:36

标签: r loops bigdata chunks

我需要将一个大文件(14千兆字节)分成较小的文件。该文件的格式为txt,标签为&#34 ;;"我知道它有70列(字符串,双)。我想阅读1百万,并将它们保存在不同的文件,file1,file2 ... fileN。

在@MKR的帮助下

但过程很慢,我尝试使用fread,但这是不可能的。

如何优化此代码?

新代码

chunkSize <- 10000
conex <- file(description = db, open = "r")
data <- read.table(conex, nrows = chunkSize, header=T, fill=TRUE, sep =";")

index <- 0
counter <- 0
total <- 0
chunkSize <- 500000 
conex <- file(description=db,open="r")   
dataChunk <- read.table(conex, nrows=chunkSize, header=T, fill=TRUE,sep=";")

repeat {
dataChunk <- read.table(conex, nrows=chunkSize, header=FALSE, fill = TRUE, sep=";", col.names=db_colnames)
total <- total + sum(dataChunk$total)
counter <- counter + nrow(dataChunk)
write.table(dataChunk, file = paste0("MY_FILE_new",index),sep=";", row.names = FALSE)

  if (nrow(dataChunk) != chunkSize){
    print('linesok')
    break}
    index <- index + 1
  print(paste('lines', index * chunkSize))
}

2 个答案:

答案 0 :(得分:3)

您完全可以通过正确的方式获得解决方案。

The approach should be:

1. Read 1 million lines 
2. Write to new files
3. Read next 1 million lines
4. Write to another new files

让我们在OP尝试的循环中转换上述逻辑:

index <- 0
counter <- 0
total <- 0
chunks <- 500000

repeat{
  dataChunk <- read.table(con, nrows=chunks, header=FALSE, fill = TRUE,                 
                          sep=";", col.names=db_colnames)

  # do processing on dataChunk (i.e adding header, converting data type) 

  # Create a new file name and write to it. You can have your own logic for file names 
  write.table(dataChunk, file = paste0("file",index))

  #check if file end has been reached and break from repeat
  if(nrow(dataChunk) < chunks){
    break
  }

  #increment the index to read next chunk
  index = index+1

}

已编辑:已修改,可根据OP的要求,使用data.table::fread读取文件来添加其他选项。

library(data.table)

index <- 0
counter <- 0
total <- 0
chunks <- 1000000
fileName <- "myfile"

repeat{
  # With fread file is opened in each iteration
  dataChunk <- fread(input = fileName, nrows=chunks, header=FALSE, fill = TRUE,                 
                          skip = chunks*index, sep=";", col.names=db_colnames)

  # do processing on dataChunk (i.e adding header, converting data type) 

  # Create a new file name and write to it. You can have your own logic for file names
  write.table(dataChunk, file = paste0("file",index))

  #check if file end has been reached and break from repeat
  if(nrow(dataChunk) < chunks){
    break
  }

  #increment the index to read next chunk
  index = index+1

}

注意:以上代码只是 pseudo code 的部分代码段来帮助OP。它不会自行运行并产生结果。

答案 1 :(得分:3)

不是基于R的答案,但在这种情况下,我推荐使用GNU split的基于shell的解决方案。这应该比R解决方案快得多。

要将文件拆分为每个包含10^6行的块,您可以执行以下操作:

split -l 1000000 my_file.txt 

有关split的详细信息,请参阅例如: here