读取具有特定扩展名的所有文件

时间:2014-01-05 16:06:57

标签: r

我有几个csv个文件存储在"C://Users//Prices//"文件夹中我想在R中读取这些文件并将它们存储为数据帧。我尝试了for loop但是这需要几个小时来读取所有文件(我测量了system.time())。

除了使用for循环之外,还能做到这一点吗?

2 个答案:

答案 0 :(得分:11)

我将重申fread明显更快,如本文关于Stack Overflow的帖子所示:Quickly reading very large tables as dataframes in R。总之,测试(在51 Mb文件 - 1e6行x 6列上)与包括sqldfffread.table在内的最佳替代方法相比,性能提升了70%以上并且没有@lukeA在答案中建议的优化设置。这已在评论中备份,这些评论报告在一分钟内fread加载4GB文件,而基本功能则为15小时。

我自己进行了一些测试,以比较阅读和组合 CSV文件的其他方法。实验设置如下:

  1. 为每次运行生成4列CSV文件(character x 1,numeric x 3)。共有6次运行,每次运行的行数不同,数据文件中的10^110^2,...,10^6记录不等。
  2. 将CSV文件导入R 10次,与rbindrbindlist一起创建一个表格。
  3. 测试read.csv& read.table,带有和不带有优化参数,例如colClasses,针对fread
  4. 使用microbenchmark重复每次测试10次(可能不必要地高!),并收集每次测试的时间。
  5. 结果再次显示fread rbindlist优于read.table优先rbind功能。

    此表显示10个文件读取的总median总持续时间&每个方法和每个文件的行数组合。前3列是微秒,最后3列是秒。

                  expr       10       100     1000     10000    1e+05     1e+06
    1:           FREAD  3.93704  5.229699 16.80106 0.1470289 1.324394  12.28122
    2:        READ.CSV 12.38413 18.887334 78.68367 0.9609491 8.820387 187.89306
    3:   READ.CSV.PLUS 10.24376 14.480308 60.55098 0.6985101 5.728035  51.83903
    4:      READ.TABLE 12.82230 21.019998 74.49074 0.8096604 9.420266 123.53155
    5: READ.TABLE.PLUS 10.12752 15.622499 57.53279 0.7150357 5.715737  52.91683
    

    此图显示了在HPC上运行10次时的时间比较:

    将这些值与fread时间标准化显示这些其他方法在所有情况下需要多长时间:

                          10      100     1000    10000    1e+05     1e+06
    FREAD           1.000000 1.000000 1.000000 1.000000 1.000000  1.000000
    READ.CSV        3.145543 3.611553 4.683256 6.535784 6.659941 15.299223
    READ.CSV.PLUS   2.601893 2.768861 3.603998 4.750835 4.325023  4.221001
    READ.TABLE      3.256838 4.019352 4.433693 5.506811 7.112887 10.058576
    READ.TABLE.PLUS 2.572370 2.987266 3.424355 4.863232 4.315737  4.308762
    

    HPC上10 microbenchmark次迭代的结果表

    有趣的是,对于每个文件100万行,read.csvread.table的优化版本比fread多花费422%和430%的时间,而没有优化,这一数字跃升至1500%和1005左右%更长。

    请注意,当我在功能强大的笔记本电脑上进行此实验而不是HPC群集时,性能提升稍微减少(大约减慢81%,而速度减慢400%)。这本身很有趣,但不确定我能否解释它呢!

                          10      100     1000    10000    1e+05     1e+06
    FREAD           1.000000 1.000000 1.000000 1.000000 1.000000  1.000000
    READ.CSV        2.595057 2.166448 2.115312 3.042585 3.179500  6.694197
    READ.CSV.PLUS   2.238316 1.846175 1.659942 2.361703 2.055851  1.805456
    READ.TABLE      2.191753 2.819338 5.116871 7.593756 9.156118 13.550412
    READ.TABLE.PLUS 2.275799 1.848747 1.827298 2.313686 1.948887  1.832518
    
    Table of results for only 5 `microbenchmark` iterations on my i7 laptop
    

    鉴于数据量相当大,我建议不仅可以通过fread读取文件,而且可以使用data.table包随后处理数据。反对传统的data.frame操作!我很幸运能够在早期阶段吸取教训,并建议其他人也效仿......

    以下是测试中使用的代码。

    rm(list=ls()) ; gc()
    library(data.table)  ; library(microbenchmark) 
    
    #=============== FUNCTIONS TO BE TESTED ===============
    
    f_FREAD = function(NUM_READS) {
        for (i in 1:NUM_READS) {
            if (i == 1) x = fread("file.csv") else x = rbindlist(list(x, fread("file.csv"))) 
        } 
    }
    f_READ.TABLE = function(NUM_READS) {
        for (i in 1:NUM_READS) {
            if (i == 1) x = read.table("file.csv") else x = rbind(x, read.table("file.csv"))
        }
    }
    f_READ.TABLE.PLUS = function (NUM_READS) {
        for (i in 1:NUM_READS) {
            if (i == 1) {
                x = read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric"))
            } else {
                x = rbind(x, read.table("file.csv", sep = ",", header = TRUE, comment.char="", colClasses = c("character", "numeric", "numeric", "numeric")))
            }
        }       
    }
    f_READ.CSV = function(NUM_READS) {
        for (i in 1:NUM_READS) {
            if (i == 1) x = read.csv("file.csv") else x = rbind(x, read.csv("file.csv"))
        }
    }
    f_READ.CSV.PLUS = function (NUM_READS) {
        for (i in 1:NUM_READS) {
            if (i == 1) {
                x = read.csv("file.csv", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric"))
            } else {
                x = rbind(x, read.csv("file.csv", comment.char="", header = TRUE, colClasses = c("character", "numeric", "numeric", "numeric")))
            }
        }       
    }
    
    #=============== MAIN EXPERIMENTAL LOOP ===============
    for (i in 1:6)
    {
        NUM_ROWS = (10^i)       # the loop allows us to test the performance over varying numbers of rows
        NUM_READS = 10
    
        # create a test data.table with the specified number of rows and write it to file
        dt = data.table(
            col1 = sample(letters[],NUM_ROWS,replace=TRUE),
            col2 = rnorm(NUM_ROWS),
            col3 = rnorm(NUM_ROWS),
            col4 = rnorm(NUM_ROWS)
        )
        write.csv(dt, "file.csv", row.names=FALSE)
    
        # run the imports for each method, recording results with microbenchmark
        results = microbenchmark(
                    FREAD = f_FREAD(NUM_READS), 
                    READ.TABLE = f_READ.TABLE(NUM_READS),
                    READ.TABLE.PLUS = f_READ.TABLE.PLUS(NUM_READS),
                    READ.CSV = f_READ.CSV(NUM_READS), 
                    READ.CSV.PLUS = f_READ.CSV.PLUS(NUM_READS), 
                    times = NUM_ITERATIONS)
        results = data.table(NUM_ROWS = NUM_ROWS, results)
        if (i == 1) results.all = results else results.all = rbindlist(list(results.all, results))      
    }
    
    results.all[,time:=time/1000000000]     # convert from nanoseconds
    

答案 1 :(得分:2)

通过

加速read.table命令
  • 预定义colClasses=c("numeric", "factor", ...)
  • 设置stringsAsFactors=FALSE
  • 使用comment.char=""
  • 停用csv评论

通过http://www.cerebralmastication.com/2009/11/loading-big-data-into-r/