我必须阅读CSV(每个超过120MB)。我使用for循环,但它非常非常慢。如何更快地阅读CSV?
我的代码:
H=data.frame()
for (i in 201:225){
for (j in 1996:2007){
filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv",sep="")
x=read.csv(filename,stringsAsFactor=F)
I=c("051","041","044","54","V0262")
temp=x[(x$A_1 %in% I)|(x$A_2 %in% I)|(x$A_3 %in% I), ]
H=rbind(H,temp)
}
}
每个文件结构都像这样
> str(x)
'data.frame': 417691 obs. of 37 variables:
$ YM: int 199604 199612 199612 199612 199606 199606 199609 199601 ...
$ A_TYPE: int 1 1 1 1 1 1 1 1 1 1 ...
$ HOSP: chr "dd0516ed3e" "c53d67027e" ...
$ A_DATE: int 19960505 19970116 19970108 ...
$ C_TYPE: int 19 9 1 1 2 9 9 1 1 1 ...
$ S_NO : int 142 37974 4580 4579 833 6846 2272 667 447 211 ...
$ C_ITEM_1 : chr "P2" "P3" "A2"...
$ C_ITEM_2 : chr "R6" "I3" ""...
$ C_ITEM_3 : chr "W2" "" "A2"...
$ C_ITEM_4 : chr "Y1" "O3" ""...
$ F_TYPE: chr "40" "02" "02" "02" ...
$ F_DATE : int 19960415 19961223 19961227 ...
$ T_END_DATE: int NA NA NA ...
$ ID_B : int 19630526 19630526 19630526 ...
$ ID : chr "fff" "fac" "eab"...
$ CAR_NO : chr "B4" "B5" "C1" "B6" ...
$ GE_KI: int 4 4 4 4 4 4 4 4 4 4 ...
$ PT_N : chr "H10" "A10" "D10" "D10" ...
$ A_1 : chr "0521" "7948" "A310" "A312" ...
$ A_2 : chr "05235" "5354" "" "" ...
$ A_3 : chr "" "" "" "" ...
$ I_O_CE: chr "5210" "" "" "" ...
$ DR_DAY : int 0 7 3 3 0 0 3 3 3 3 ...
$ M_TYPE: int 2 0 0 0 2 2 0 0 0 0 ...
........
答案 0 :(得分:3)
我认为这里的重大性能问题是您迭代地增长H
对象。每次对象增长时,操作系统都需要为它分配更多。这个过程需要很长时间。一个简单的解决方法是将H
预分配到正确的行数。如果事先不知道行数,您可以预先分配好数量,并根据需要调整大小。
或者,以下方法不会形成我上面描述的问题:
list_of_files = list.files('dir_where_files_are', pattern = '*csv', full.names = TRUE)
big_data_frame = do.call('rbind', lapply(list_of_files, read.csv, sep = ""))
答案 1 :(得分:2)
这可能不是最有效或最优雅的方法,但这是我会做的,基于一些假设,其中缺少更多信息;特别是,不能做任何测试:
确保RSQLite
已安装(sqldf
可以选择,如果你有足够的内存,但我个人更喜欢拥有一个“真正的”数据库,我也可以使用其他工具访问)。 / p>
# make sqlite available
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "hannah.sqlite" )
# create a vector with your filenames
filenames <- NULL
for (i in 201:225)
{
for ( j in 1996:2007 )
{
fname <- paste( "D:/Hannah/CD/CD.R", i, "_cd", j, ".csv", sep="" )
filenames <- c( filenames, fname )
}
}
# extract the DB structure, create empty table
x <- read.csv( filenames[1], stringsAsFactor = FALSE, nrows = 1 )
dbWriteTable( db, "all", x, row.names = FALSE )
dbGetQuery( db, "DELETE FROM all" )
# a small table for your selection criteria (build in flexibility for the future)
I <- as.data.frame( c( "051", "041", "044", "54", "V0262" ) )
dbWriteTable( db, "crit", I, row.names = FALSE )
# move your 300 .csv files into that table
# (you probably do that better using the sqlite CLI but more info would be needed)
for( f in filenames )
{
x <- read.csv( f, stringsAsFactor = FALSE )
dbWriteTable( db, "all", x, append = TRUE, row.names = FALSE )
}
# now you can extract the subset in one go
extract <- dbGetQuery( db, "SELECT * FROM all
WHERE A_1 IN (SELECT I FROM crit ) OR
A_2 IN (SELECT I FROM crit ) OR
A_3 IN (SELECT I FROM crit )" )
这未经过测试但应该可以工作(如果没有,请告诉我它停在哪里)它应该更快,不会遇到内存问题。但同样,没有真正的数据也没有真正的解决方案!
答案 2 :(得分:2)
您还可以使用fread()
包中的函数data.table
。与read.csv
相比,它相当快。另外,尝试循环遍历list.files()
。