我使用stats包中的 glm 函数构建了逻辑回归模型。 我现在想要预测这个模型在大量值上的结果,存储在" ffdf" 对象中(参见ff包),但我找不到如何继续进行:
如何创建我的ffdf对象的子集,以便只保留我的预测中使用的变量(即列)? - 需要在预测函数中指定为输入
下一步该怎么办?应该在 predict(),predict.glm(),predict.bigglm()之间使用哪个函数(也许biglm包有用)?
提前感谢您对此的看法!
祝你好运
更新
感谢您的反馈BondedDust 让我更准确一点,它确实是一个编码问题,旨在基于ffdf对象(学习数据集)执行逻辑回归,并预测另一个ffdf对象(测试数据集)的模型结果。
(1/3)学习数据集:ffdf对象(使用ff包创建)。
` class(train.random.sample)` >
[1] "ffdf"
下面是需要时ffdf对象的结构:
`str(train.random.sample) ` >
List of 3
$ virtual: 'data.frame': 27 obs. of 7 variables:
.. $ VirtualVmode : chr "integer" "integer" "integer" "integer" ...
.. $ AsIs : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ VirtualIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalElementNo: int 1 2 3 4 5 6 7 8 9 10 ...
.. $ PhysicalFirstCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. $ PhysicalLastCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. - attr(*, "Dim")= int 500000 27
.. - attr(*, "Dimorder")= int 1 2
$ physical: List of 27
.. $ id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ click : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ hour : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ C1 : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ banner_pos : list()
.. ..- attr(*, "physical")=Class 'ff_pointer' <externalptr>
.. .. ..- attr(*, "vmode")= chr "integer"
.. .. ..- attr(*, "maxlength")= int 500000
.. .. ..- attr(*, "pattern")= chr "ffdf"
.. .. ..- attr(*, "filename")= chr "anonymized.ff"
.. .. ..- attr(*, "pagesize")= int 65536
.. .. ..- attr(*, "finalizer")= chr "delete"
.. .. ..- attr(*, "finonexit")= logi TRUE
.. .. ..- attr(*, "readonly")= logi FALSE
.. .. ..- attr(*, "caching")= chr "mmnoflush"
.. ..- attr(*, "virtual")= list()
.. .. ..- attr(*, "Length")= int 500000
.. .. ..- attr(*, "Symmetric")= logi FALSE
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ site_id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ site_domain : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ site_category : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ app_id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ app_domain : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ app_category : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_id : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_ip : list()
….
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_os : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_make : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_model : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_type : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_conn_type : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ device_geo_country: list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
.. $ C17 : list()
…
.. .. - attr(*, "class") = chr [1:2] "ff_vector" "ff"
$ row.names: NULL
- attributes: List of 2
.. $ names: chr [1:3] "virtual" "physical" "row.names"
.. $ class: chr "ffdf"
基于学习数据集的(2/3)逻辑回归:
目标是根据'baser_pos'输入学习/预测'点击'结果
`logreg1 <- glm(click ~ banner_pos, data = train.random.sample, family = "binomial")
summary(logreg1)` >
Call:
glm(formula = click ~ banner_pos, family = "binomial", data = train.random.sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0555 -0.6495 -0.5951 -0.5951 1.9071
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.641416 0.004702 -349.12 <2e-16 xxx
banner_pos 0.192534 0.007595 25.35 <2e-16 xxx
---
Signif. codes: 0 ‘xxx’ 0.001 ‘xx’ 0.01 ‘x’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 458848 on 499999 degrees of freedom
Residual deviance: 458215 on 499998 degrees of freedom
AIC: 458219
Number of Fisher Scoring iterations: 4
`class(logreg1)`>
[1] "glm" "lm"
(3/3)测试数据集:ffdf对象(使用ff包创建)。
`class(df.test)` >
[1] "ffdf"
测试数据集结构与训练数据集相同,行数约为4.8米
`str(df.test)`>
List of 3
$ virtual: 'data.frame': 26 obs. of 7 variables:
.. $ VirtualVmode : chr "integer" "integer" "integer" "integer" ...
.. $ AsIs : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ VirtualIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalIsMatrix : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
.. $ PhysicalElementNo: int 1 2 3 4 5 6 7 8 9 10 ...
.. $ PhysicalFirstCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. $ PhysicalLastCol : int 1 1 1 1 1 1 1 1 1 1 ...
.. - attr(*, "Dim")= int 4769401 26
.. - attr(*, "Dimorder")= int 1 2
$ physical: List of 26
…
我无法成功预测点击结果。 我首先尝试创建一个包含banner_pos变量的数据帧或ffdf对象:
`modeldata <- df.test[["banner_pos"]]`
然后我试着预测结果:
`predict.glm(object = logreg1, newdata = modeldata, type = "response")`
Error in as.data.frame.default(data) :
cannot coerce class "c("ff_vector", "ff")" to a data.frame
我的代码中有什么问题吗?我应该使用其他功能来利用biglm等其他软件包吗? 非常感谢您对该问题的看法, 最好的问候
答案 0 :(得分:0)
类似的内容会将ffdf
与glm
一起评分。
require(ff)
df.test$score <- ff(as.numeric(NA), length = nrow(df.test))
chunks <- chunk(df.test)
for(chunkrangeindex in chunks){
df.test$score[chunkrangeindex] <- predict(object = logreg1, newdata = df.test[chunkrangeindex, ], type = "response")
}