rbindlist少量的大数据。表s

时间:2017-01-29 08:27:01

标签: r memory-management data.table

我有31个data.table s

的列表
List31 <- lapply(list.files(pattern = "^[0-9]?[0-9]\\.csv$"), fread, na.strings = c("*", "", "NA"))

Read 3562435 rows and 47 (of 47) columns from 0.335 GB file in 00:00:06
Read 4567456 rows and 47 (of 47) columns from 0.412 GB file in 00:00:08
Read 4626490 rows and 47 (of 47) columns from 0.418 GB file in 00:00:07
Read 4705069 rows and 47 (of 47) columns from 0.427 GB file in 00:00:08
Read 4857572 rows and 47 (of 47) columns from 0.450 GB file in 00:00:08
Read 5111313 rows and 47 (of 47) columns from 0.497 GB file in 00:00:09
Read 5275920 rows and 47 (of 47) columns from 0.515 GB file in 00:00:09
Read 5306365 rows and 47 (of 47) columns from 0.518 GB file in 00:00:09
Read 5354296 rows and 47 (of 47) columns from 0.799 GB file in 00:00:15
Read 5499375 rows and 47 (of 47) columns from 0.826 GB file in 00:00:16
Read 5795744 rows and 47 (of 47) columns from 0.897 GB file in 00:00:17
Read 3773471 rows and 47 (of 47) columns from 0.355 GB file in 00:00:06
Read 6059535 rows and 47 (of 47) columns from 0.897 GB file in 00:00:17
Read 5354296 rows and 47 (of 47) columns from 0.794 GB file in 00:00:14
Read 5499375 rows and 47 (of 47) columns from 0.821 GB file in 00:00:14
Read 5795744 rows and 47 (of 47) columns from 0.891 GB file in 00:00:18
Read 6059535 rows and 47 (of 47) columns from 0.891 GB file in 00:00:15
Read 6533038 rows and 47 (of 47) columns from 0.962 GB file in 00:00:19
Read 6975097 rows and 47 (of 47) columns from 1.066 GB file in 00:00:19
Read 7191321 rows and 47 (of 47) columns from 1.099 GB file in 00:00:23
Read 7436464 rows and 47 (of 47) columns from 1.139 GB file in 00:00:21
Read 7698811 rows and 47 (of 47) columns from 1.181 GB file in 00:00:21
Read 4165634 rows and 47 (of 47) columns from 0.393 GB file in 00:00:07
Read 7935997 rows and 47 (of 47) columns from 1.218 GB file in 00:00:25
Read 8226535 rows and 47 (of 47) columns from 1.264 GB file in 00:00:23
Read 4172169 rows and 47 (of 47) columns from 0.393 GB file in 00:00:06
Read 4132907 rows and 47 (of 47) columns from 0.390 GB file in 00:00:06
Read 4114118 rows and 47 (of 47) columns from 0.389 GB file in 00:00:11
Read 4182943 rows and 47 (of 47) columns from 0.396 GB file in 00:00:07
Read 4348032 rows and 47 (of 47) columns from 0.391 GB file in 00:00:07
Read 4486031 rows and 47 (of 47) columns from 0.404 GB file in 00:00:07

在磁盘上包含21.6 GB,在内存中包含大约27 GB。我的机器有64 GB RAM;但是,当我尝试rbindlist(List31, use.names = TRUE, fill = TRUE)时,我耗尽了记忆

Error: cannot allocate vector of size 650.0 Mb

我的机器的最佳策略是什么? (我不能使用AWS等。)

  1. 我尝试增加交换文件。我正在使用Windows 10并尝试增加虚拟内存,但似乎没有任何好处。
  2. 在列表的每个元素中手动填充NA个缺失列,然后使用use.names = FALSE, fill = FALSE。我尝试了以下但得到了同样的错误。

    library(magrittr)
    names_by_class <-
      lapply(List31, function(DT){
        vapply(DT, class, character(1))
      }) %>%
      unlist %>%
      {
        data.table(name = names(.),
                   class = .)
      } %>%
      unique
    
    List31_names <- 
      lapply(List31, names) %>%
      unlist %>% 
      unique
    
    ii <- 1
    add_nas <- function(DT){
      for (j in setdiff(List31_names, names(DT))){
        # Look up class:
        classes <- names_by_class[name == j][["class"]]
        if (length(classes) > 1){
          if (any(classes == "character")){
            classes <- "character"
          } else {
            if (any(classes == "numeric")){
              classes <- "numeric"
            } else {
              if (any(classes == "integer")){
                classes <- "integer"
              } else {
                classes <- "logical"
              }
            }
          }
        }
        switch(classes,
               "integer" = {
                 DT[, (j) := NA_integer_]
               },
               "character" = {
                 DT[, (j) := NA_character_]
               }, 
               "numeric" = {
                 DT[, (j) := NA_real_]
               }, 
               "logical" = {
                 DT[, (j) := NA]
               })
    
      }
      fwrite(DT, paste0(ii, ".csv"))
    }
    for (i in seq_along(List31)){
      add_nas(List31[[i]])
      ii <- ii + 1
    }
    

    列类型的基数:

    1   integer    32
    2 character    11
    3   numeric     4
    

0 个答案:

没有答案