Question

我最近开始使用RODBC以I couldn't get RPostgreSQL to compile and run in Windows x64的身份连接到PostgreSQL。我发现两个软件包之间的读取性能相似，但写入性能不是这样。例如，使用RODBC（其中z是~6.1M行数据帧）：

library(RODBC)
con <- odbcConnect("PostgreSQL84")

#autoCommit=FALSE seems to speed things up
odbcSetAutoCommit(con, autoCommit = FALSE)
system.time(sqlSave(con, z, "ERASE111", fast = TRUE))

user  system elapsed
275.34  369.86 1979.59 

odbcEndTran(con, commit = TRUE)
odbcCloseAll()

对于使用RPostgreSQL（32位以下）的相同~6.1M行数据帧：

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="gisdb", user="postgres", password="...")
system.time(dbWriteTable(con, "ERASE222", z))

user  system elapsed 
467.57   56.62  668.29 

dbDisconnect(con)

因此，在这个测试中，RPostgreSQL在写表时的速度是RODBC的3倍。无论数据帧中的行数是多少，这个性能比似乎都保持不变（但列数的影响要小得多）。我注意到RPostgreSQL使用COPY <table> FROM STDIN之类的东西，而RODBC发出了一堆INSERT INTO <table> (columns...) VALUES (...)个查询。我还注意到RODBC似乎选择int8作为整数，而RPostgreSQL在适当的时候选择int4。

我经常需要做这种数据帧复制，所以我会非常真诚地欣赏有关加速RODBC的任何建议。例如，这只是ODBC中固有的，还是我没有正确地调用它？

Answer 1

似乎没有立竿见影的答案，所以我会发布一个蹩脚的解决方法，以防它对任何人都有帮助。

Sharpie是正确的 - COPY FROM是迄今为止将数据输入Postgres的最快方式。基于他的建议，我已经将一个功能强大的功能提升了RODBC::sqlSave()。例如，使用下面的函数编写一个110万行（24列）数据帧，通过sqlSave使用960秒（已过去）vs 69秒。我不会预料到这一点，因为数据一次写入磁盘然后再写入数据库。

library(RODBC)
con <- odbcConnect("PostgreSQL90")

#create the table
createTab <- function(dat, datname) {

  #make an empty table, saving the trouble of making it by hand
  res <- sqlSave(con, dat[1, ], datname)
  res <- sqlQuery(con, paste("TRUNCATE TABLE",datname))

  #write the dataframe
  outfile = paste(datname, ".csv", sep = "")
  write.csv(dat, outfile)
  gc()   # don't know why, but memory is 
         # not released after writing large csv?

  # now copy the data into the table.  If this doesn't work,
  # be sure that postgres has read permissions for the path
  sqlQuery(con,  
  paste("COPY ", datname, " FROM '", 
    getwd(), "/", datname, 
    ".csv' WITH NULL AS 'NA' DELIMITER ',' CSV HEADER;", 
    sep=""))

  unlink(outfile)
}

odbcClose(con)

改善RODBC-Postgres的写作表现

1 个答案: