Question

我使用stats包中的 glm 函数构建了逻辑回归模型。我现在想要预测这个模型在大量值上的结果，存储在＆＃34; ffdf＆＃34; 对象中（参见ff包），但我找不到如何继续进行：

如何创建我的ffdf对象的子集，以便只保留我的预测中使用的变量（即列）？ - 需要在预测函数中指定为输入
下一步该怎么办？应该在 predict（），predict.glm（），predict.bigglm（）之间使用哪个函数（也许biglm包有用）？

提前感谢您对此的看法！

祝你好运

更新

感谢您的反馈BondedDust 让我更准确一点，它确实是一个编码问题，旨在基于ffdf对象（学习数据集）执行逻辑回归，并预测另一个ffdf对象（测试数据集）的模型结果。

（1/3）学习数据集：ffdf对象（使用ff包创建）。

` class(train.random.sample)` >   
[1] "ffdf"

下面是需要时ffdf对象的结构：

`str(train.random.sample) ` >

List of 3   
 $ virtual: 'data.frame':   27 obs. of  7 variables:   
 .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
 .. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
 .. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
 .. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
 .. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
 .. - attr(*, "Dim")= int  500000 27   
 .. - attr(*, "Dimorder")= int  1 2   
 $ physical: List of 27   
 .. $ id                : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ click             : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ hour              : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ C1                : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ banner_pos        : list()   
 ..  ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>    
 ..  .. ..- attr(*, "vmode")= chr "integer"   
 ..  .. ..- attr(*, "maxlength")= int 500000   
 ..  .. ..- attr(*, "pattern")= chr "ffdf"   
 ..  .. ..- attr(*, "filename")= chr "anonymized.ff"   
 ..  .. ..- attr(*, "pagesize")= int 65536   
 ..  .. ..- attr(*, "finalizer")= chr "delete"   
 ..  .. ..- attr(*, "finonexit")= logi TRUE   
 ..  .. ..- attr(*, "readonly")= logi FALSE   
 ..  .. ..- attr(*, "caching")= chr "mmnoflush"   
 ..  ..- attr(*, "virtual")= list()   
 ..  .. ..- attr(*, "Length")= int 500000   
 ..  .. ..- attr(*, "Symmetric")= logi FALSE    
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_id           : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_domain       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ site_category     : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_id            : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_domain        : list()   
…  
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ app_category      : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_id         : list()   
 …   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_ip         : list()   
….   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_os         : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_make       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_model      : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_type       : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_conn_type  : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ device_geo_country: list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
 .. $ C17               : list()   
…   
 .. .. - attr(*, "class") =  chr [1:2] "ff_vector" "ff"   
$ row.names:  NULL   
- attributes: List of 2   
 .. $ names: chr [1:3] "virtual" "physical" "row.names"   
 .. $ class: chr "ffdf"

（2/3）逻辑回归：

目标是根据'baser_pos'输入学习/预测'点击'结果

`logreg1 <- glm(click ~ banner_pos, data = train.random.sample, family = "binomial")   
summary(logreg1)` >   


Call:
glm(formula = click ~ banner_pos, family = "binomial", data = train.random.sample)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0555  -0.6495  -0.5951  -0.5951   1.9071  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.641416   0.004702 -349.12   <2e-16 xxx
banner_pos   0.192534   0.007595   25.35   <2e-16 xxx
---
Signif. codes:  0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 458848  on 499999  degrees of freedom
Residual deviance: 458215  on 499998  degrees of freedom
AIC: 458219

Number of Fisher Scoring iterations: 4

`class(logreg1)`>
[1] "glm" "lm"

（3/3）测试数据集：ffdf对象（使用ff包创建）。

`class(df.test)` >   
[1] "ffdf"

测试数据集结构与训练数据集相同，行数约为4.8米

`str(df.test)`>   

List of 3   
 $ virtual: 'data.frame':   26 obs. of  7 variables:   
 .. $ VirtualVmode     : chr  "integer" "integer" "integer" "integer" ...   
.. $ AsIs             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ VirtualIsMatrix  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ PhysicalIsMatrix : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...   
.. $ PhysicalElementNo: int  1 2 3 4 5 6 7 8 9 10 ...   
.. $ PhysicalFirstCol : int  1 1 1 1 1 1 1 1 1 1 ...   
.. $ PhysicalLastCol  : int  1 1 1 1 1 1 1 1 1 1 ...   
.. - attr(*, "Dim")= int  4769401 26   
.. - attr(*, "Dimorder")= int  1 2   
$ physical: List of 26   
…

我无法成功预测点击结果。我首先尝试创建一个包含banner_pos变量的数据帧或ffdf对象：

`modeldata <- df.test[["banner_pos"]]`

然后我试着预测结果：

`predict.glm(object = logreg1, newdata = modeldata, type = "response")`

Error in as.data.frame.default(data) : 
  cannot coerce class "c("ff_vector", "ff")" to a data.frame

我的代码中有什么问题吗？我应该使用其他功能来利用biglm等其他软件包吗？非常感谢您对该问题的看法，最好的问候

Answer 1

类似的内容会将ffdf与glm一起评分。

require(ff)
df.test$score <- ff(as.numeric(NA), length = nrow(df.test))
chunks <- chunk(df.test)
for(chunkrangeindex in chunks){
  df.test$score[chunkrangeindex] <- predict(object = logreg1, newdata = df.test[chunkrangeindex, ], type = "response")
}

R对ffdf对象的Logistic回归

1 个答案: