我有31个data.table
s
List31 <- lapply(list.files(pattern = "^[0-9]?[0-9]\\.csv$"), fread, na.strings = c("*", "", "NA"))
Read 3562435 rows and 47 (of 47) columns from 0.335 GB file in 00:00:06
Read 4567456 rows and 47 (of 47) columns from 0.412 GB file in 00:00:08
Read 4626490 rows and 47 (of 47) columns from 0.418 GB file in 00:00:07
Read 4705069 rows and 47 (of 47) columns from 0.427 GB file in 00:00:08
Read 4857572 rows and 47 (of 47) columns from 0.450 GB file in 00:00:08
Read 5111313 rows and 47 (of 47) columns from 0.497 GB file in 00:00:09
Read 5275920 rows and 47 (of 47) columns from 0.515 GB file in 00:00:09
Read 5306365 rows and 47 (of 47) columns from 0.518 GB file in 00:00:09
Read 5354296 rows and 47 (of 47) columns from 0.799 GB file in 00:00:15
Read 5499375 rows and 47 (of 47) columns from 0.826 GB file in 00:00:16
Read 5795744 rows and 47 (of 47) columns from 0.897 GB file in 00:00:17
Read 3773471 rows and 47 (of 47) columns from 0.355 GB file in 00:00:06
Read 6059535 rows and 47 (of 47) columns from 0.897 GB file in 00:00:17
Read 5354296 rows and 47 (of 47) columns from 0.794 GB file in 00:00:14
Read 5499375 rows and 47 (of 47) columns from 0.821 GB file in 00:00:14
Read 5795744 rows and 47 (of 47) columns from 0.891 GB file in 00:00:18
Read 6059535 rows and 47 (of 47) columns from 0.891 GB file in 00:00:15
Read 6533038 rows and 47 (of 47) columns from 0.962 GB file in 00:00:19
Read 6975097 rows and 47 (of 47) columns from 1.066 GB file in 00:00:19
Read 7191321 rows and 47 (of 47) columns from 1.099 GB file in 00:00:23
Read 7436464 rows and 47 (of 47) columns from 1.139 GB file in 00:00:21
Read 7698811 rows and 47 (of 47) columns from 1.181 GB file in 00:00:21
Read 4165634 rows and 47 (of 47) columns from 0.393 GB file in 00:00:07
Read 7935997 rows and 47 (of 47) columns from 1.218 GB file in 00:00:25
Read 8226535 rows and 47 (of 47) columns from 1.264 GB file in 00:00:23
Read 4172169 rows and 47 (of 47) columns from 0.393 GB file in 00:00:06
Read 4132907 rows and 47 (of 47) columns from 0.390 GB file in 00:00:06
Read 4114118 rows and 47 (of 47) columns from 0.389 GB file in 00:00:11
Read 4182943 rows and 47 (of 47) columns from 0.396 GB file in 00:00:07
Read 4348032 rows and 47 (of 47) columns from 0.391 GB file in 00:00:07
Read 4486031 rows and 47 (of 47) columns from 0.404 GB file in 00:00:07
在磁盘上包含21.6 GB,在内存中包含大约27 GB。我的机器有64 GB RAM;但是,当我尝试rbindlist(List31, use.names = TRUE, fill = TRUE)
时,我耗尽了记忆
Error: cannot allocate vector of size 650.0 Mb
我的机器的最佳策略是什么? (我不能使用AWS等。)
在列表的每个元素中手动填充NA
个缺失列,然后使用use.names = FALSE, fill = FALSE
。我尝试了以下但得到了同样的错误。
library(magrittr)
names_by_class <-
lapply(List31, function(DT){
vapply(DT, class, character(1))
}) %>%
unlist %>%
{
data.table(name = names(.),
class = .)
} %>%
unique
List31_names <-
lapply(List31, names) %>%
unlist %>%
unique
ii <- 1
add_nas <- function(DT){
for (j in setdiff(List31_names, names(DT))){
# Look up class:
classes <- names_by_class[name == j][["class"]]
if (length(classes) > 1){
if (any(classes == "character")){
classes <- "character"
} else {
if (any(classes == "numeric")){
classes <- "numeric"
} else {
if (any(classes == "integer")){
classes <- "integer"
} else {
classes <- "logical"
}
}
}
}
switch(classes,
"integer" = {
DT[, (j) := NA_integer_]
},
"character" = {
DT[, (j) := NA_character_]
},
"numeric" = {
DT[, (j) := NA_real_]
},
"logical" = {
DT[, (j) := NA]
})
}
fwrite(DT, paste0(ii, ".csv"))
}
for (i in seq_along(List31)){
add_nas(List31[[i]])
ii <- ii + 1
}
列类型的基数:
1 integer 32
2 character 11
3 numeric 4