我正在分析一个具有1.14 GB(1,232,705,653字节)的数据集。
在R中读取数据时:
trade = read.csv("commodity_trade_statistics_data.csv")
可以看到它有8225871个实例和10个属性。
当我打算通过数据整理Web应用程序分析数据集时,该应用程序对100MB的导入有限制,我想知道如何将数据拆分为最大100MB的文件?
我打算进行的拆分是按行进行的,每个文件都应包含标题。
答案 0 :(得分:3)
将数据帧拆分为所需数量的块。这是内置mtcars
数据集的示例:
no_of_chunks <- 5
f <- ceiling(1:nrow(mtcars) / nrow(mtcars) * 5)
res <- split(mtcars, f)
然后您可以使用purrr
将结果另存为csv:
library(purrr)
map2(res, paste0("chunk_", names(res), ".csv"), write.csv)
编辑: 在我的问题中,以下脚本解决了该问题:
trade = read.csv("commodity_trade_statistics_data.csv")
no_of_chunks <- 14
f <- ceiling(1:nrow(trade) / nrow(trade) * 14)
res <- split(trade, f)
library(purrr)
map2(res, paste0("chunk_", names(res), ".csv"), write.csv)
答案 1 :(得分:1)
或者将tydiverse与readr::read_csv_chunked
和readr::write_csv
一起使用,例如:
require(tidyverse)
inputfile = "./commodity_trade_statistics_data.csv"
chunck.size = 100 * 1000 * 5
proc.chunk = function(df, pos) {
df %>%
mutate(row_idx=seq(pos, length.out = df %>% nrow)) %>% # optional, preserve row number within original csv
write_csv(
paste0("chunk_", floor(pos / chunck.size), ".csv")
)
}
read_csv_chunked(
inputfile,
callback=SideEffectChunkCallback$new(proc.chunk),
chunk_size = chunck.size,
progress = T # optional, show a progress bar
)
在readr::read_csv_chunked
内,还可以通过readr::locale
设置参数local
,以定义日期时间和时区的格式。
为了“简化” chunck.size
的估计,我基于简单的统计信息编写了一个粗略的函数:
calc.chunk.size = function(
file,
limit.mb.per.file = 100, # limit MB per file
floor.to.nrow=10000, # floor to nrows per file for easier numbers
size.estimate.sigma = 5, # assume row.size normally distr. and first n row representative
# 3-sigma left 0.1% chance of having file exceeding limit size
# 5-sigma is, closer to Six Sigma :)
sample.first.nrow = 100L
) {
# requires package "R.utils"
tot.nrow = R.utils::countLines(file)
# alternatively if on POSIX machine (Linux or macOS)
# tot.nrow = system("wc -l ./commodity_trade_statistics_data.csv", intern=T) %>%
# str_extract("\\d+") %>%
# as.numeric()
bytes.per.row = read_lines(file,
skip=1, # skip header, or the 1st row anyway ...
n_max=sample.first.nrow # take sample of at most first N row
) %>%
nchar(type="bytes") %>% { list(mean=mean(.), sd=sd(.)) }
est.bytes.per.row = bytes.per.row$mean + size.estimate.sigma * bytes.per.row$sd
est.chunck.size = limit.mb.per.file * 1000 * 1000 / est.bytes.per.row
est.chunck.size = max(c(1, floor(est.chunck.size / floor.to.nrow))) * floor.to.nrow
est.chunck.size
}
chunck.size = calc.chunk.size(inputfile)
# chunk.size = 540000 for this particular file