在带有glmnet和dotCall64的R中使用长向量

时间:2019-02-08 20:33:21

标签: r

我正在使用glmnet和glmnetcr来拟合序数回归模型。

不幸的是,我的模型矩阵是〜640000 *5000。它大于可以存储在32位整数中的值,并且遇到了其他人描述的相同问题:R vector size limit: "long vectors (argument 5) are not supported in .C"

如果我只使用一半的数据,则可以在具有足够内存的本地服务器上运行它,而不会出现问题。

我已经尝试过使用dotCall64包在上述文章中实现“解决方案”。我用.C64替换了.Fortran调用,并为每个变量指定了数据类型。但是,每次我运行代码时,我要么得到无意义的lambda值(9.9e35)要么出现以下段错误:

*捕获了段错误* 地址0x1511aaeb0,导致“内存未映射”

我得到哪一个,确切的地址每次都不同,所以我认为在实施此解决方案时做错了什么。

这是函数lognet()中到目前为止的代码(该函数最终由glmnetcr和glmnet调用,并将变量传递给fortran代码)

lognet()中的原始代码

.Fortran("lognet", parm = alpha, nobs, nvars, nc, as.double(x), 
        y, offset, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh, 
        isd, intr, maxit, kopt, lmu = integer(1), a0 = double(nlam * 
            nc), ca = double(nx * nlam * nc), ia = integer(nx), 
        nin = integer(nlam), nulldev = double(1), dev = double(nlam), 
        alm = double(nlam), nlp = integer(1), jerr = integer(1), 
        PACKAGE = "glmnet")

lognet()中修改的代码

.C64("lognet", SIGNATURE = c("double","int",   "int",   "int",   "int64",                         
                             "double","double","int",   "double","double"
                             "int",   "int",   "int",   "double","double",
                             "double","int",   "int",   "int",   "int",
                             "int",   "double","double","int",   "int",
                             "double","double","double","int",    "int"),
                parm = alpha, nobs, nvars, nc, as.double(x), 
                y, offset, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh, 
                isd, intr, maxit, kopt, lmu = integer(1), a0 = double(nlam * nc), ca = double(nx * nlam * nc), ia = integer(nx), 
                nin = integer(nlam), nulldev = double(1), dev = double(nlam), 
                alm = double(nlam), nlp = integer(1), jerr = integer(1), 
                PACKAGE = "glmnet")

玩具示例(数据比实际小得多)

library(glmnetcr)
library(dotCall64)

x1 <- cbind(c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1),c(0,0,0,1,0,1,1,1,0,0,0,0,0,1,1,1),c(0,0,1,0,1,0,1,1,0,0,0,0,1,0,1,1),c(0,1,0,0,1,1,0,1,0,0,0,0,1,1,0,1),c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1),c(0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1),c(0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1))
y1 <- c(0,0,0,1,1,1,2,2,0,1,0,1,1,2,1,2)

testA  <- glmnetcr(x=x1,y=y1,method = "forward", nlambda=10,lambda.min.ratio=0.001, alpha =1,maxit = 500,standardize=FALSE)

使用原始lognet()代码运行此命令不会产生任何问题。 使用修改后的lognet()代码运行它会导致奇怪的lambda值估计和/或段错误(似乎是随机发生的)。我的第一个猜测是,我输入了一个错误的变量,但是我经历了两次所有操作,看不到问题所在。另一个选择是,底层的fortran代码无法处理64位整数。我知道fortran为零,即使是这种情况,我也不知道如何开始解决问题。

1 个答案:

答案 0 :(得分:2)

因此,我联系了glmnet的软件包维护者。他们曾经有过转换为.C64的经验。有了他们的帮助和一点点摆弄,我得以使以下代码正常工作。为此,我创建了一个名为glmnet64的新函数,该函数调用了另一个新函数lognet64而不是原始的lognet调用。 lognet64与原始lognet函数相同,但是用以下内容替换了.Fortran调用:

.C64("lognet", SIGNATURE =   c("double", "integer","integer","integer","double",
                               "double", "double", "integer","double", "double",
                               "integer","integer","integer","double", "double",
                               "double", "integer","integer","integer","integer",
                               "integer","double", "double", "integer","integer",
                               "double", "double", "double","integer","integer"),
          parm = alpha,nobs,           nvars,            nc,      as.double(x), 
          y,           offset,         jd,                  vp,      cl, 
          ne,          nx,             nlam,                flmin,   ulam, 
          thresh,      isd,            intr,                maxit,   kopt, 
          lmu = integer(1),    a0 = double(nlam * nc), 
          ca = double(nx * nlam * nc), ia = integer(nx), nin = integer(nlam), 
          nulldev = double(1), dev = double(nlam),     alm = double(nlam),          
          nlp = integer(1), jerr = integer(1), 
          INTENT = c(rep("rw",4),"r",rep("rw",15),rep("w",10)),     
          PACKAGE = "glmnet",
          NAOK = TRUE)

关键似乎是正确指定了所有变量类型。能够在.Fortran调用之前使用browser()来获得正确的权限。另外,通过指定INTENT并设置NAOK = TRUE(如预期的那样)可以提高速度。肯定会推荐那些。