大文件的逐行操作

时间:2017-02-13 09:35:05

标签: r csv

所以我有一个大型的CSV文件,大约280列和10亿个数据,文件大小约为20GB。下面提供了此文件的示例(包含大约7列和4行):

SL No.,Roll No.,J_Date,F_Date,S1,S2,S3
1,00123456789,2004/09/11,2009/08/20,43,67,56
2,987654321,2010/04/01,2015/02/20,82,98,76
3,0123459876,2000/06/25,2005/10/02,72,84,02
4,000543216789,1990/08/29,1998/05/31,15,64,82

现在鉴于文件太大,我必须一次以较小的块读取此文件,因为我能够指定块大小。但正如你可能从样本中看到的那样," Roll No。"必须被视为"字符"而不是"数字" 。此外,我需要添加列" S1"," S2"," S3"并将总和写入新栏目" MM"

上述样本的输出必须是这样的:

SL No.,Roll No.,J_Date,F_Date,S1,S2,S3,MM
1,00123456789,2004/09/11,2009/08/20,43,67,56,166
2,987654321,2010/04/01,2015/02/20,82,98,76,256
3,0123459876,2000/06/25,2005/10/02,72,84,02,158
4,000543216789,1990/08,29,1998/05/31,15,64,82,161

我之前已经问过类似的问题,但我发誓,我无法得到一个对我有用的答案。我提到了以下命令:

R:Loops to process large dataset(GBs) in chunks?

Trimming a huge (3.5 GB) csv file to read into R

How do i read only lines that fulfil a condition from a csv into R?

Reading numbers as strings

Read numeric input as string R 还有很多。

这可能是一个很好的时间来说,对于R来说,我是一个初学者,因此非常感谢各种帮助。我现在已经坐了很久了。

1 个答案:

答案 0 :(得分:1)

我不能说我以前自己做过这件事,但我认为这应该有效。

library( data.table )

# set the input and output files
input.file <- "foo.csv"
output.file <- sub( "\\.csv$", "_output\\.csv", input.file )

# get column names by importing the first few lines
column.names <- names( fread( input.file, header = TRUE, nrows = 3L ) )

# write those column names as a line of text (header)
cat( paste( c( column.names, "MM" ), collapse = "," ),
     file = output.file, append = FALSE )
cat( "\n", file = output.file, append = TRUE )

# decide how many rows to read at a time
rows.at.a.time <- 1E4L

# begin looping
start.row <- 1L
while( TRUE ) {

    # read in only the specified lines
    input <- fread( input.file,
                    header = FALSE,
                    skip = start.row,
                    nrows = rows.at.a.time
    )

    # stop looping if no data was read
    if( nrow( input ) == 0L ) break

    # create the "MM" column
    input[ , MM := rowSums( .SD[ , 5:7 ] ) ]

    # append the data to the output file
    fwrite( input,
            file = output.file,
            append = TRUE, col.names = FALSE )

    # bump the `start.row` parameter
    start.row <- start.row + rows.at.a.time

    # stop reading if the end of the file was reached
    if( nrow( input ) < rows.at.a.time ) break

}

更新:要保留字符串,您可以通过在循环内的fread调用中指定来将所有列导入为字符:

colClasses = rep( "character", 280 )

然后,要执行行总和(因为您现在拥有所有字符列),您需要在那里包含转换。以下内容将替换代码中的单行(上面带有相同注释的那一行):

# create the "MM" column
input[ , MM := .SD[ , 5:7 ] %>%
           lapply( as.numeric ) %>%
           do.call( what = cbind ) %>%
           rowSums()
       ]

此处指定了5:7,您可以将任何列引用向量替换为rowSums()

请注意,如果将以上内容与%>%管道一起使用,则代码顶部需要library(magrittr)才能加载该功能。