我有数百个需要在R中使用的矩阵,其中大多数都是45000x350左右。我想要做的是找到一个最佳的数据库软件选择和模式来存储数据,并能够从数据库调用矩阵的子集。这需要尽可能快地提取数据。
这里的基础代码是创建类似于我正在处理的5个矩阵的代码:
if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)
numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)
dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})
symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})
for(i in 1:numVariables) {
assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
nrow=numSymbols,
ncol=(numMonths+1),
dimnames=list(symbolNames,dateVector)))
}
基本上所有的矩阵都有大约一半的值填充了双打和其余的NA。
# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"
# > dim(Variable_1)
# [1] 45000 346
# > Variable_1[1:10,1:3]
# 31/12/1982 31/01/1983 28/02/1983
# AF3HM5V NA NA -1.15076100366945755
# DL8TVIY NA NA -1.59412257037490046
# JEFDYPO NA NA NA
# 21ZV689 NA NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC NA NA -1.31113363079930023
# XG3ZA1W NA 0.8531074740045220 0.06797987526470438
# W1JCXIE 0.2782029710832690 -1.2668560986048898 NA
# X3RKT2B 1.5220172324681460 -1.0460218516729356 NA
# 3EUB8VN -0.9405417187846803 1.1151437940206490 1.60520458945005262
我希望能够将这些存储在数据库中。 RDBMS将是默认选项,但我愿意看看其他选项。最大的部分是快速查询的最佳解决方案,无论是整个矩阵还是矩阵的子集,例如2000个符号,100个日期等。
我一直在使用的当前解决方案是将每个矩阵保存为RData文件,然后加载整个矩阵并截断大小以供使用。这非常快,但我觉得数据库设计在符号+日期和数据备份方面扩展矩阵方面会更有利。
到目前为止,我在RDBMS选项方面的尝试是:
A)
- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes
B)
- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes
C) [应该在这里尝试过基于列的数据库]
- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes
与上述数据库设置相比,RData文件中的整个变量的加载在本地占用5秒,在网络上占用20秒。所有数据库时间都在网络上。
我能做些什么来使数据库路由接近二进制文件的速度吗?
也许我需要一个表格式的nosql数据库?
如何根据附加符号+日期进行扩展?
任何处理过类似事情的人的帮助都会受到赞赏。
更新:我以为我会发布更新。最后我采用了Iterator的建议,数据现在托管在bigmemory内存映射文件中,然后RData用于快速使用拖放检查以及输出到csv并由SQL Server提取以备份。任何数据库解决方案都太慢,不能被多个用户使用。同样使用RODBC对抗SQL服务器是疯狂的慢,但尝试输入和输出到R通过CSV往返于SQL,这是好的但没有意义。
同样对于引用,字节编译bigmemory的load方法确实会产生影响。以下是我对RData vs bigmemory的负载测试结果。
workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")
LoadVariablesInFolder <- function(folder, sedols, dates) {
filesInFolder <- dir(folder)
filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
filesToLoad <- paste(folder,filesToLoad,sep="/")
variablesThatWereLoaded <- c()
for(fToLoad in filesToLoad) {
loadedVar <- load(fToLoad)
assign(loadedVar,get(loadedVar)[sedols,dates])
gc() -> ans
variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
rm(list=c(loadedVar))
}
return(variablesThatWereLoaded)
}
cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)
BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
workD <- getwd()
setwd(folder)
filesInFolder <- dir(folder)
filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
variablesThatWereLoaded <- c()
for(fToLoad in filesToLoad) {
tempVar <- attach.big.matrix(dget(fToLoad))
loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
assign(loadedVar,tempVar[sedols,dates])
variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
rm(list=c(loadedVar,"tempVar"))
gc() -> ans
}
setwd(workD)
return(variablesThatWereLoaded)
}
cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)
testCases <- list(
list(numSedols=1000,numDates=120),
list(numSedols=5000,numDates=120),
list(numSedols=50000,numDates=120),
list(numSedols=1000,numDates=350),
list(numSedols=5000,numDates=350),
list(numSedols=50000,numDates=350))
load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))
for (testCase in testCases) {
results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
columns=c("test", "replications","elapsed", "relative"),
order="relative", replications=3)
cat("Results for testcase:\n")
print(testCase)
print(results)
}
基本上,子集越小,获得的越多,因为您不会花时间加载整个矩阵。但是加载整个矩阵比bigmemory慢于RData,我想这是转换开销:
# Results for testcase:
# $numSedols
# [1] 1000
# $numDates
# [1] 120
# test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 4 3 6.799999999999955 1.000000000000000
# 3 3 14.389999999999986 2.116176470588247
# 1 3 235.639999999999986 34.652941176470819
# 2 3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000
# $numDates
# [1] 120
# test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 4 3 7.080000000000155 1.000000000000000
# 3 3 32.730000000000018 4.622881355932105
# 1 3 249.389999999999873 35.224576271185654
# 2 3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000
# $numDates
# [1] 120
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 146.3499999999999 1.000000000000000
# 4 3 148.1799999999998 1.012504270584215
# 2 3 238.3200000000002 1.628425008541171
# 1 3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000
# $numDates
# [1] 350
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 83.88000000000011 1.000000000000000
# 4 3 91.71000000000004 1.093347639484977
# 1 3 235.69000000000005 2.809847401049115
# 2 3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000
# $numDates
# [1] 350
# test
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 3 3 135.6999999999998 1.000000000000000
# 4 3 155.8900000000003 1.148784082535008
# 2 3 233.3699999999999 1.719749447310245
# 1 3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000
# $numDates
# [1] 350
# test
# 2 cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1 LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3 BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# replications elapsed relative
# 2 3 236.5000000000000 1.000000000000000
# 1 3 237.2100000000000 1.003002114164905
# 3 3 388.2900000000000 1.641818181818182
# 4 3 393.6300000000001 1.664397463002115
答案 0 :(得分:4)
我强烈建议使用HDF5。我假设您的数据足够复杂,各种bigmemory
文件(即内存映射矩阵)不能轻易满足您的需求(参见注释1),但HDF5只是缺少内存映射文件的速度。请参阅this longer answer to another question以了解我如何比较HDF5和.RDat文件。
最值得注意的是,HDF5支持随机访问这一事实意味着您应该能够大幅度提高速度。
另一种选择,取决于您是否愿意设计自己的二进制格式,是使用readBin
和writeBin
,尽管这并不具备HDF5的所有优点,包括并行I / O,版本信息,可移植性等。
注1:如果每行只有几种类型,即1个字符,其余为数字,则可以简单地创建2个内存映射矩阵,其中一个用于字符,另一个用于数值。这样您就可以在bigmemory
套件中使用mwhich
,bigtabulate
,bigmemory
和许多其他不错的功能。我给予了合理的努力,因为它是一个非常容易的系统,可以平滑地集成大量的R代码:矩阵不需要输入内存,只需要你需要的子集,许多实例可以同时访问相同的文件。更重要的是,使用foreach()
的多核后端很容易并行访问。我曾经有一个操作,每个.Rdat文件需要大约3分钟:大约需要2分钟加载,大约20秒来分选我需要的东西,大约10秒钟来分析,大约30秒来保存结果。切换到bigmemory
后,我在大约10秒内分析了I / O大约5-15秒。
更新1:我忽略了ff包 - 这是另一个不错的选择,虽然它比bigmemory复杂得多。
答案 1 :(得分:1)
也许database design of the TSdbi package鼓舞人心......
对于nosql解决方案,hdf5可能是一个选项。不过,我对此并不了解。
答案 2 :(得分:0)
您还可以深入研究sqldf的使用/内部。看起来他们的数据库工作得到了很好的整理。在他们的页面上也有很多有趣的例子。