R对ffdf对象的Logistic回归

时间:2014-11-10 22:29:05

标签: r regression ff

我使用stats包中的 glm 函数构建了逻辑回归模型。 我现在想要预测这个模型在大量值上的结果,存储在" ffdf" 对象中(参见ff包),但我找不到如何继续进行:

  1. 如何创建我的ffdf对象的子集,以便只保留我的预测中使用的变量(即列)? - 需要在预测函数中指定为输入

  2. 下一步该怎么办?应该在 predict(),predict.glm(),predict.bigglm()之间使用哪个函数(也许biglm包有用)?

  3. 提前感谢您对此的看法!

    祝你好运

    更新

    感谢您的反馈BondedDust 让我更准确一点,它确实是一个编码问题,旨在基于ffdf对象(学习数据集)执行逻辑回归,并预测另一个ffdf对象(测试数据集)的模型结果。

    (1/3)学习数据集:ffdf对象(使用ff包创建)。

    ` class(train.random.sample)` >   
    [1] "ffdf"
    

    下面是需要时ffdf对象的结构:

    `str(train.random.sample) ` >
    
    List of 3   
     $ virtual: 'data.frame':   27 obs. of  7 variables:   
     .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
     .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
     .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
     .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
     .. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
     .. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
     .. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
     .. - attr(*, "Dim")= int  500000 27   
     .. - attr(*, "Dimorder")= int  1 2   
     $ physical: List of 27   
     .. $ id                : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ click             : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ hour              : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ C1                : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ banner_pos        : list()   
     ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>    
     ..  .. ..- attr(*, "vmode")= chr "integer"   
     ..  .. ..- attr(*, "maxlength")= int 500000   
     ..  .. ..- attr(*, "pattern")= chr "ffdf"   
     ..  .. ..- attr(*, "filename")= chr "anonymized.ff"   
     ..  .. ..- attr(*, "pagesize")= int 65536   
     ..  .. ..- attr(*, "finalizer")= chr "delete"   
     ..  .. ..- attr(*, "finonexit")= logi TRUE   
     ..  .. ..- attr(*, "readonly")= logi FALSE   
     ..  .. ..- attr(*, "caching")= chr "mmnoflush"   
     ..  ..- attr(*, "virtual")= list()   
     ..  .. ..- attr(*, "Length")= int 500000   
     ..  .. ..- attr(*, "Symmetric")= logi FALSE    
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ site_id           : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ site_domain       : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ site_category     : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ app_id            : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ app_domain        : list()   
    …  
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ app_category      : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_id         : list()   
     …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_ip         : list()   
    ….   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_os         : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_make       : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_model      : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_type       : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_conn_type  : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ device_geo_country: list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
     .. $ C17               : list()   
    …   
     .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
    $ row.names:  NULL   
    - attributes: List of 2   
     .. $ names: chr [1:3] "virtual" "physical" "row.names"   
     .. $ class: chr "ffdf"   
    
    基于学习数据集的

    (2/3)逻辑回归

    目标是根据'baser_pos'输入学习/预测'点击'结果

    `logreg1 <- glm(click ~ banner_pos, data = train.random.sample, family = "binomial")   
    summary(logreg1)` >   
    
    
    Call:
    glm(formula = click ~ banner_pos, family = "binomial", data = train.random.sample)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.0555  -0.6495  -0.5951  -0.5951   1.9071  
    
    Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
    (Intercept) -1.641416   0.004702 -349.12   <2e-16 xxx
    banner_pos   0.192534   0.007595   25.35   <2e-16 xxx
    ---
    Signif. codes:  0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 458848  on 499999  degrees of freedom
    Residual deviance: 458215  on 499998  degrees of freedom
    AIC: 458219
    
    Number of Fisher Scoring iterations: 4
    
    `class(logreg1)`>
    [1] "glm" "lm" 
    

    (3/3)测试数据集:ffdf对象(使用ff包创建)。

    `class(df.test)` >   
    [1] "ffdf"
    

    测试数据集结构与训练数据集相同,行数约为4.8米

    `str(df.test)`>   
    
    List of 3   
     $ virtual: 'data.frame':   26 obs. of  7 variables:   
     .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
    .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
    .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
    .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
    .. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
    .. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
    .. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
    .. - attr(*, "Dim")= int  4769401 26   
    .. - attr(*, "Dimorder")= int  1 2   
    $ physical: List of 26   
    …   
    

    我无法成功预测点击结果。 我首先尝试创建一个包含banner_pos变量的数据帧或ffdf对象:

    `modeldata <- df.test[["banner_pos"]]`
    

    然后我试着预测结果:

    `predict.glm(object = logreg1, newdata = modeldata, type = "response")`
    
    Error in as.data.frame.default(data) : 
      cannot coerce class "c("ff_vector", "ff")" to a data.frame
    

    我的代码中有什么问题吗?我应该使用其他功能来利用biglm等其​​他软件包吗? 非常感谢您对该问题的看法, 最好的问候

1 个答案:

答案 0 :(得分:0)

类似的内容会将ffdfglm一起评分。

require(ff)
df.test$score <- ff(as.numeric(NA), length = nrow(df.test))
chunks <- chunk(df.test)
for(chunkrangeindex in chunks){
  df.test$score[chunkrangeindex] <- predict(object = logreg1, newdata = df.test[chunkrangeindex, ], type = "response")
}