Question

我必须阅读CSV（每个超过120MB）。我使用for循环，但它非常非常慢。如何更快地阅读CSV？

我的代码：

H=data.frame()
for (i in 201:225){
    for (j in 1996:2007){
        filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv",sep="")
        x=read.csv(filename,stringsAsFactor=F)
        I=c("051","041","044","54","V0262")
        temp=x[(x$A_1 %in% I)|(x$A_2 %in% I)|(x$A_3 %in% I), ]
        H=rbind(H,temp)
    }
}

每个文件结构都像这样

> str(x)
'data.frame':   417691 obs. of  37 variables:
$ YM: int  199604 199612 199612 199612 199606 199606 199609 199601 ...
$ A_TYPE: int  1 1 1 1 1 1 1 1 1 1 ...
$ HOSP: chr  "dd0516ed3e" "c53d67027e" ...
$ A_DATE: int  19960505 19970116 19970108  ...
$ C_TYPE: int  19 9 1 1 2 9 9 1 1 1 ...
$ S_NO : int  142 37974 4580 4579 833 6846 2272 667 447 211 ...
$ C_ITEM_1 : chr  "P2" "P3" "A2"...
$ C_ITEM_2 : chr  "R6" "I3" ""...
$ C_ITEM_3 : chr  "W2" "" "A2"...
$ C_ITEM_4 : chr  "Y1" "O3" ""...
$ F_TYPE: chr  "40" "02" "02" "02" ...
$ F_DATE : int  19960415 19961223 19961227  ...
$ T_END_DATE: int  NA NA NA  ...
$ ID_B : int  19630526 19630526 19630526  ...
$ ID : chr  "fff" "fac" "eab"...
$ CAR_NO : chr  "B4" "B5" "C1" "B6" ...
$ GE_KI: int  4 4 4 4 4 4 4 4 4 4 ...
$ PT_N : chr  "H10" "A10" "D10" "D10" ...
$ A_1  : chr  "0521" "7948" "A310" "A312" ...
$ A_2  : chr  "05235" "5354" "" "" ...
$ A_3  : chr  "" "" "" "" ...
$ I_O_CE: chr  "5210" "" "" "" ...
$ DR_DAY : int  0 7 3 3 0 0 3 3 3 3 ...
$ M_TYPE: int  2 0 0 0 2 2 0 0 0 0 ...

........

Answer 1

我认为这里的重大性能问题是您迭代地增长H对象。每次对象增长时，操作系统都需要为它分配更多。这个过程需要很长时间。一个简单的解决方法是将H预分配到正确的行数。如果事先不知道行数，您可以预先分配好数量，并根据需要调整大小。

或者，以下方法不会形成我上面描述的问题：

list_of_files = list.files('dir_where_files_are', pattern = '*csv', full.names = TRUE)
big_data_frame = do.call('rbind', lapply(list_of_files, read.csv, sep = ""))

Answer 2

这可能不是最有效或最优雅的方法，但这是我会做的，基于一些假设，其中缺少更多信息;特别是，不能做任何测试：

确保RSQLite已安装（sqldf可以选择，如果你有足够的内存，但我个人更喜欢拥有一个“真正的”数据库，我也可以使用其他工具访问）。 / p>

# make sqlite available
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "hannah.sqlite" )

# create a vector with your filenames
filenames <- NULL
for (i in 201:225)
{
    for ( j in 1996:2007 )
    {
        fname <- paste( "D:/Hannah/CD/CD.R", i, "_cd", j, ".csv", sep="" ) 
        filenames <- c( filenames, fname )
    }
}

# extract the DB structure, create empty table
x <- read.csv( filenames[1], stringsAsFactor = FALSE, nrows = 1 )
dbWriteTable( db, "all", x, row.names = FALSE )
dbGetQuery( db, "DELETE FROM all" )

# a small table for your selection criteria (build in flexibility for the future)
I <- as.data.frame( c( "051", "041", "044", "54", "V0262" ) )
dbWriteTable( db, "crit", I, row.names = FALSE )

# move your 300 .csv files into that table
# (you probably do that better using the sqlite CLI but more info would be needed)
for( f in filenames )
{
    x <- read.csv( f, stringsAsFactor = FALSE )
    dbWriteTable( db, "all", x, append = TRUE, row.names = FALSE )
}

# now you can extract the subset in one go
extract <- dbGetQuery( db, "SELECT * FROM all 
                       WHERE A_1 IN (SELECT I FROM crit ) OR
                             A_2 IN (SELECT I FROM crit ) OR
                             A_3 IN (SELECT I FROM crit )"   )

这未经过测试但应该可以工作（如果没有，请告诉我它停在哪里）它应该更快，不会遇到内存问题。但同样，没有真正的数据也没有真正的解决方案！

Answer 3

您还可以使用fread()包中的函数data.table。与read.csv相比，它相当快。另外，尝试循环遍历list.files()。

如何在R中更快地读取CSV？

3 个答案: