我知道如何打开连接并使用read.table [EDIT:fread does not allow connections]读取数据块,删除一些行并按顺序在列表中收集结果数据。但有没有其他方法可以优化,所以可以在fread中读取块并同时处理?
我正在使用Windows。
到目前为止,我在线收集的内容可以使用Cygwin -split-将我拥有的大型csv文件拆分为多个较小的csv文件,然后使用parLapply来解决所有这些文件。
你们有更好的主意吗?
答案 0 :(得分:2)
这是尝试将fread调用并行化到数据块。这个解决方案有很多来自
的元素TryCatch with parLapply (Parallel package) in R
require(data.table)
require(dplyr)
require(parallel)
gc()
#=========================================================================
# generating test data
#=========================================================================
set.seed(1)
m <- matrix(rnorm(1e5),ncol=2)
csv <- data.frame(x=1:1e2,m)
names(csv) <- c(letters[1:3])
head(csv)
write.csv(csv,"test.csv")
#=========================================================================
# defining function to read chunks of data with fread: fread_by_chunks
#=========================================================================
fread_by_chunks <- function(filepath, counter, ChunkSize, ...) {
chunk <- as.character({(counter-1)/ChunkSize}+1)
print(paste0("Working on chunk ", chunk, "..."))
DT <- tryCatch(fread(filepath,
skip=counter,
nrows=ChunkSize,
...),
error=function(e) message(conditionMessage(e)))
# This condition checks that no errors occured
if(!class(DT)[1]=="data.table"){
DT <- data.table(cbind(chunk=chunk,is.empty="YES"))
# Just in case files are still empty even though no error
} else if(nrow(DT)==0){
DT <- data.table(cbind(chunk=chunk,is.empty="YES"))
# Apply filter here using column indexes DT[DT[[ncol]]] as columns are not named, automatic names (Vs) do not work.
} else {
DT[,chunk := chunk]
DT[,is.empty := "NO"]
}
return(DT)
}
#=========================================================================
# testing fread_by_chunks
#=========================================================================
ChunkSize = 1000
n_rows = 60000 # test.csv has 50e3 lines, we want to test if the code breaks with a call to nrows above that.
## In this step you have to make a guess as to how many rows there are in the dataset you are reading in. Guess a large number to make sure all the lines will be read. When the number of rows in your guess is above the actual number, the code will return a row with the field is.empty == "YES". You just have to delete these rows afterwards. If no such rows are there you cannot be sure you have read all the rows from the csv file.
counter <- c(0, seq(ChunkSize, n_rows, ChunkSize)) + 1
start_time <- Sys.time()
test <- lapply(counter, function(x) {fread_by_chunks(filepath = "test.csv", counter = x, ChunkSize = ChunkSize, header = F, fill = T, blank.lines.skip=T, select=c(1,2,4))})
Sys.time() - start_time
##Time difference of 0.2528741 secs
# binding chunks
test <- bind_rows(test)
#=========================================================================
# parallelizing fread_by_chunks
#=========================================================================
no_cores <- detectCores() - 1 # 3 cores, 2.8 Ghz
cl <- makeCluster(no_cores)
clusterExport(cl, c("data.table", "ChunkSize", "counter", "fread_by_chunks", "n_rows"))
clusterEvalQ(cl, library(data.table))
start_time <- Sys.time()
test <- parLapply(cl, counter, function(x) {fread_by_chunks(filepath = "test.csv", counter = x, ChunkSize = 1000, header = F, fill = T, blank.lines.skip=T, select=c(1,2,4))})
Sys.time() - start_time
##Time difference of 0.162251 secs
stopCluster(cl)
test <- bind_rows(test)
# just calling fread without blocks. It obviously takes a lot less time, but we have memory to fit all the data.
start_time <- Sys.time()
test <- fread("test.csv",
skip=0,
nrows=ChunkSize,
header=F,
fill = T,
blank.lines.skip=T,
select=c(1,2,4))
Sys.time() - start_time
#Time difference of 0.006005049 secs
答案 1 :(得分:1)
我喜欢您的解决方案和计时测试,但希望我能更清楚地了解问题。问题是您没有足够的内存来读取整个文件,还是想通过并行化来更快地读取和处理数据?
如果问题是“文件大小”>“内存”,但是可能只适合您要在内存中的行和列,那么我建议使用awk来制作一个仅包含所需行和列的较小的csv,然后读取那。 awk
逐行处理,因此内存不会成为问题。这是示例awk代码,用于跳过空白行并将第1、2和4列输出到较小的.csv。
awk -F',' 'BEGIN{OFS=","}{if($1!="")print $1,$2,$4}' big.csv > smaller.csv
如果问题是速度,我想最快的选择是一次读取文件,然后使用例如parLapply或更简单的mclapply并行处理。