减少glmer模型的大小

时间:2015-07-11 17:31:44

标签: r lme4

我是R的新手,我使用glmer来拟合几个二项式模型,我只需要它们来调用predict来使用结果概率。但是,我有一个非常大的数据集,即使只有一个模型的大小也变得非常大:

> library(pryr)
> object_size(mod)
701 MB

模型系数的大小相形见绌:

> object_size(coef(mod))
1.16 MB

拟合值的大小也是如此:

> object_size(fitted(mod))
25.6 MB

首先,我不明白为什么模型的对象大小如此之大。它似乎包含用于适合模型的原始数据框架,但即使这样也不能解释尺寸。为什么这么大?

其次,是否可以将模型剥离为仅调用预测所需的部分?如果是这样,我该怎么做呢?我在http://blog.yhathq.com/posts/reducing-your-r-memory-footprint-by-7000x.html找到了glm的帖子,但似乎glmer模型的访问方式不同,组件也不同。

非常感谢任何帮助。

编辑:

深入了解模型的内部结构:

> object_size(getME(mod, "X"))
205 MB
> object_size(getME(mod, "Z"))
36.9 MB
> object_size(getME(mod, "Zt"))
38.4 MB
> object_size(getME(mod, "Ztlist"))
41.6 MB
> object_size(getME(mod, "mmList"))
38.4 MB
> object_size(getME(mod, "y"))
3.2 MB
> object_size(getME(mod, "mu"))
3.2 MB
> object_size(getME(mod, "u"))
18.4 kB
> object_size(getME(mod, "b"))
19.5 kB
> object_size(getME(mod, "Gp"))
56 B
> object_size(getME(mod, "Tp"))
472 B
> object_size(getME(mod, "L"))
15.5 MB
> object_size(getME(mod, "Lambda"))
38.1 kB
> object_size(getME(mod, "Lambdat"))
38.1 kB
> object_size(getME(mod, "Lind"))
9.22 kB
> object_size(getME(mod, "Tlist"))
936 B
> object_size(getME(mod, "A"))
38.4 MB
> object_size(getME(mod, "RX"))
30.3 kB
> object_size(getME(mod, "RZX"))
1.05 MB
> object_size(getME(mod, "sigma"))
48 B
> object_size(getME(mod, "flist"))
4.89 MB
> object_size(getME(mod, "fixef"))
4.5 kB
> object_size(getME(mod, "beta"))
496 B
> object_size(getME(mod, "theta"))
472 B
> object_size(getME(mod, "ST"))
936 B
> object_size(getME(mod, "REML"))
48 B
> object_size(getME(mod, "is_REML"))
48 B
> object_size(getME(mod, "n_rtrms"))
48 B
> object_size(getME(mod, "n_rfacs"))
48 B
> object_size(getME(mod, "N"))
256 B
> object_size(getME(mod, "n"))
256 B
> object_size(getME(mod, "p"))
256 B
> object_size(getME(mod, "q"))
256 B
> object_size(getME(mod, "p_i"))
408 B
> object_size(getME(mod, "l_i"))
408 B
> object_size(getME(mod, "q_i"))
408 B
> object_size(getME(mod, "mod"))
48 B
> object_size(getME(mod, "m_i"))
424 B
> object_size(getME(mod, "m"))
48 B
> object_size(getME(mod, "cnms"))
624 B
> object_size(getME(mod, "devcomp"))
2.21 kB
> object_size(getME(mod, "offset"))
3.2 MB

> get_obj_size(mod@resp, "RC")
                       [,1]
family            673355488
initialize        673355488
initialize#lmResp 673355488
ptr               673355488
resDev            673355488
updateMu          673355488
updateWts         673355488
wrss              673355488
eta                 3196024
mu                  3196024
n                   3196024
offset              3196024
sqrtrwt             3196024
sqrtXwt             3196024
weights             3196024
wtres               3196024
y                   3196024
Ptr                      40
> get_obj_size(mod@pp, "RC")
                   [,1]
beta          449419408
initialize    449419408
initializePtr 449419408
ldL2          449419408
ldRX2         449419408
linPred       449419408
ptr           449419408
setTheta      449419408
sqrL          449419408
u             449419408
X             204549128
V             182171288
Ut             38448168
Zt             38448168
LamtUt         38353248
Xwts            3196024
RZX             1047176
Lambdat           38136
VtV               26192
delu              18408
u0                18408
Utr               18408
Lind               9224
beta0               496
delb                496
Vtr                 496
theta                72
Ptr                  40

2 个答案:

答案 0 :(得分:4)

暂时发布为不完整的答案:

on

按照Steve Walker的S3 / S4 / Reference类字典列出和提取字段:

library("lme4")
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
               data = cbpp, family = binomial)
library("pryr") 
object_size(gm1)  ## 505 kB

值得进一步深入研究响应和预测模块,看看有什么/哪些是大的,并注意到一些信息将存储在那些组件的环境

例如,我认为名义上相同大小的整个组件实际上并不是独立的,而是具有相同的环境......

get_obj_size <- function(obj,type="S4") {
    fields <- switch(type,
                     S4=slotNames(obj),
                     RC=ls(obj))
    get_field <- switch(type,
                     S4=function(x) slot(obj,x),
                     RC=function(x) obj[[x]])
    field_list <- setNames(lapply(fields,get_field),fields)
    cbind(sort(sapply(field_list,object_size),decreasing=TRUE))
}
get_obj_size(gm1)
##           [,1]
## resp    356620  ## 'response module'
## pp      355420  ## 'predictor module'
## frame     6640
## optinfo   1748
## devcomp   1424
## call      1244
## flist     1232
## cnms       224
## u          152
## beta        56
## Gp          32
## lower       32
## theta       32

查看存储组件的另一种方法是使用get_obj_size(gm1@resp,"RC") ## [,1] ## initialize 356620 ## initialize#lmResp 356620 ## ptr 356620 ## resDev 356620 ## setOffset 356620 ## updateMu 356620 ## updateWts 356620 ## wrss 356620 ## family 26016 ## eta 472 ## mu 472 ## n 472 ## offset 472 ## sqrtrwt 472 ## sqrtXwt 472 ## weights 472 ## wtres 472 ## y 472 ## Ptr 20 并迭代通过object_size(getME(model,component))列出的组件;这与信息在内部存储的方式不太精确对应,但可以让您了解需要多少空间来保存(例如)固定效应或随机效应模型矩阵。

我对此进行了更多的工作,并且有一个部分解决方案,但仍有很多存储,我似乎无法正确找到/删除(注意,这需要最新版本的Github上的eval(formals(getME)$name):我必须稍微修改lme4函数以削弱对内部结构的依赖。)

predict

最后注意glmer_chop <- function(object) { newobj <- object newobj@frame <- model.frame(object)[0,] newobj@pp <- with(object@pp, new("merPredD", Lambdat=Lambdat, Lind=Lind, theta=theta, u=u,u0=u0, n=nrow(X), X=matrix(1,nrow=nrow(X)), Zt=Zt)) ## .sparseDiagonal(n,shape="g"))) newobj@resp <- new("glmResp",family=binomial(),y=numeric(0)) return(newobj) } get_obj_size(environment(fm2@pp$initialize),"RC") fm1 <- glmer(use ~ urban+age+livch+(1|district), Contraception, binomial) object_size(Contraception) ## 133 kB object_size(fm1) ## 1.05 MB object_size(fm2 <- glmer_chop(fm1)) ## 699 kB get_obj_size(fm2) ## 'pp' is 547200 bytes get_obj_size(fm2@pp,"RC") ## 'initialize' object is 547200 saveRDS(fm2,file="tmp.rds") fm2 <- readRDS("tmp.rds") object_size(fm2) ## 796 kB rm(fm1) pp <- predict(fm2,newdata=Contraception) object_size(fm2) ## still 796K; no sharing 确认此处的大部分信息都存储在环境中,而不是存储在对象本身中(但我不知道compare_size(fm2) / compare_size如何处理参考班级......)

答案 1 :(得分:0)

您是否关注存储空间或RAM?如果它是关于存储的,一个选项是嵌入调用以在生成预测的代码中估计模型,因此您永远不会实际存储模型对象。类似的东西:

predictions <- predict(glmer(y ~ x, family = binomial), type = "response")