我正在使用glm()
拟合逻辑回归模型。
问题在于最后两个变量:change_per_call
和change_per_min
。
我的数据结构是:
<pre>
'data.frame': 66115 obs. of 54 variables:
$ Customer_ID : num 1030539 1025498 1053921 1005718 1046164 ...
$ actvsubs : num 1 1 1 2 1 1 1 2 2 1 ...
$ adjrev : num 1017 1784 821 1398 404 ...
$ adjmou : num 5690 9367 5161 7294 1975 ...
$ avgmou : num 284 468 469 203 152 ...
$ avgrev : num 50.8 89.2 74.6 38.9 31.1 ...
$ avgqty : num 92 277.2 116.8 59.5 65.8 ...
$ age1 : num 52 60 30 56 0 20 52 34 32 0 ...
$ age2 : num 34 66 28 48 0 0 0 0 0 0 ...
$ blck_dat_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ callwait_Mean : num 0 4 0.667 0 0.333 ...
$ callwait_Range : num 0 2 2 0 1 0 1 0 0 2 ...
$ change_mou : num -97.5 129 339.8 86 -63.2 ...
$ children : num 1 1 0 1 0 0 0 0 0 0 ...
$ comp_vce_Mean : num 43.3 304.3 154.7 7 51 ...
$ custcare_Mean : num 0 0 1 0 0 ...
$ datovr_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ da_Mean : num 0.743 0 0.495 0 0 ...
$ drop_blk_Mean : num 5.667 26.333 6.333 0.333 3.333 ...
$ drop_dat_Mean : num 0 0 0 0 0 0 0 0 0 0 ...
$ drop_vce_Mean : num 2 21 4.333 0.333 2 ...
$ eqpdays : num 633 178 97 748 427 474 279 332 362 885 ...
$ forgntvl : num 0 0 0 0 0 0 0 0 0 0 ...
$ income : num 6 6 8 9 5.79 ...
$ months : num 21 22 12 38 14 16 9 23 12 30 ...
$ mou_Mean : num 166 590 695 76 125 ...
$ ovrmou_Mean : num 10 253 91 0 0 ...
$ ovrrev_Mean : num 3.5 88.6 27.3 0 0 ...
$ retdays : num 235 235 235 235 235 ...
$ rev_Range : num 14.99 75.95 110.19 0 0.54 ...
$ roam_Mean : num 0.292 0 0 0 0 ...
$ totcalls : num 1843 5586 1285 2181 865 ...
$ totrev : num 1049 1835 911 1449 464 ...
$ wrkwoman : num 0 1 0 1 0 0 0 0 0 0 ...
$ asl_flag : num 0 0 1 0 0 0 0 0 0 0 ...
$ dwlltype : num 0 0 0 0 0 1 1 0 0 0 ...
$ refurb_new : num 1 0 1 1 1 1 1 1 1 1 ...
$ mtrcycle : num 0 0 0 0 0 0 0 0 0 0 ...
$ truck : num 0 1 0 0 0 0 0 0 0 0 ...
$ hnd_price : num 150 30 130 150 150 ...
$ models : num 1 2 2 2 1 1 1 2 1 1 ...
$ numbcars : num 2 1 1 2 1.57 ...
$ churn : num 1 1 1 1 1 1 1 1 1 1 ...
$ urban : num 1 0 0 0 0 0 1 0 0 1 ...
$ city : num 0 1 0 0 0 0 0 0 0 0 ...
$ suburban : num 0 0 1 1 0 1 0 0 1 0 ...
$ town : num 0 0 0 0 1 0 0 0 0 0 ...
$ married : num 1 1 0 1 0 0 0 1 0 0 ...
$ married_A : num 0 0 1 0 0 0 0 0 0 0 ...
$ married_B : num 0 0 0 0 0 0 0 0 0 1 ...
$ married_S : num 0 0 0 0 0 1 1 0 0 0 ...
$ webcap : num 1 1 1 1 1 1 1 1 1 0 ...
<strong>$ change_per_call: num 0.569 0.328 0.709 0.664 0.537 ...
$ change_per_min : num 0.179 0.19 0.159 0.192 0.205 ...</strong>
</pre>
当我在没有change_per_call
和change_per_min
变量的情况下拟合模型时,我得到以下输出:
model5 <- glm(formula =churn~ actvsubs+ adjrev+
#adjmou+
avgmou+
#avgrev+
#avgqty+
age1+
#age2+
#blck_dat_Mean+
#callwait_Mean+
#callwait_Range+
change_mou+ children+comp_vce_Mean+custcare_Mean+
#datovr_Mean+
#da_Mean+
#drop_blk_Mean+
#drop_dat_Mean+
drop_vce_Mean+eqpdays+
#forgntvl+
#income+
months+mou_Mean+
#ovrmou_Mean+
ovrrev_Mean+
retdays+rev_Range+roam_Mean+
#totcalls+
#totrev+
wrkwoman+asl_flag+dwlltype+refurb_new+
#mtrcycle+truck+
hnd_price+models
#+numbcars
,data = train[,-1], family = "binomial")
summary(model5)
Call:
glm(formula = churn ~ actvsubs + adjrev + avgmou + age1 + change_mou +
children + comp_vce_Mean + custcare_Mean + drop_vce_Mean +
eqpdays + months + mou_Mean + ovrrev_Mean + retdays + rev_Range +
roam_Mean + wrkwoman + asl_flag + dwlltype + refurb_new +
hnd_price + models, family = "binomial", data = train[, -1])
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3444 -0.7723 -0.6684 -0.4529 2.6948
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.420e-01 9.426e-02 -5.750 8.91e-09 ***
actvsubs 3.840e-02 1.769e-02 2.170 0.029970 *
adjrev -7.662e-05 2.339e-05 -3.276 0.001053 **
avgmou 7.974e-04 6.461e-05 12.341 < 2e-16 ***
age1 -6.242e-03 5.647e-04 -11.053 < 2e-16 ***
change_mou -1.625e-04 4.536e-05 -3.583 0.000339 ***
children 7.449e-02 2.816e-02 2.645 0.008171 **
comp_vce_Mean -8.242e-04 1.943e-04 -4.242 2.21e-05 ***
custcare_Mean -7.873e-03 3.122e-03 -2.522 0.011669 *
drop_vce_Mean 9.510e-03 1.618e-03 5.878 4.14e-09 ***
eqpdays 1.068e-03 7.618e-05 14.017 < 2e-16 ***
months -1.189e-02 2.308e-03 -5.153 2.57e-07 ***
mou_Mean -7.885e-04 6.308e-05 -12.502 < 2e-16 ***
ovrrev_Mean 5.766e-03 6.177e-04 9.335 < 2e-16 ***
retdays -1.884e-03 3.047e-04 -6.184 6.24e-10 ***
rev_Range 3.309e-04 2.382e-04 1.389 0.164751
roam_Mean 3.713e-03 1.160e-03 3.201 0.001370 **
wrkwoman -5.017e-02 3.523e-02 -1.424 0.154425
asl_flag -4.570e-01 3.801e-02 -12.024 < 2e-16 ***
dwlltype 7.346e-02 2.841e-02 2.586 0.009715 **
refurb_new -1.958e-01 3.632e-02 -5.390 7.03e-08 ***
hnd_price -1.519e-03 2.359e-04 -6.436 1.23e-10 ***
models 1.059e-01 2.036e-02 5.201 1.98e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50785 on 46279 degrees of freedom
Residual deviance: 49410 on 46257 degrees of freedom
AIC: 49456
如果我添加变量change_per_call
和change_per_min
,我会收到错误消息:
model6 <- glm(formula =churn~ actvsubs+ adjrev+
#adjmou+
avgmou+
#avgrev+
#avgqty+
age1+
#age2+
#blck_dat_Mean+
#callwait_Mean+
#callwait_Range+
change_mou+ children+comp_vce_Mean+custcare_Mean+
#datovr_Mean+
#da_Mean+
#drop_blk_Mean+
#drop_dat_Mean+
drop_vce_Mean+eqpdays+
#forgntvl+
#income+
months+mou_Mean+
#ovrmou_Mean+
ovrrev_Mean+
retdays+rev_Range+roam_Mean+
#totcalls+
#totrev+
wrkwoman+asl_flag+dwlltype+refurb_new+
#mtrcycle+truck+
hnd_price+models
#+numbcars
+change_per_call+ change_per_min,data = train[,-1],
family = "binomial")
Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, :
NA/NaN/Inf in 'x'
注意:变量change_per_call
change_per_min
都是数字的;它们不包含任何零值。
如何克服此类错误?
删除以下两个变量的无限值后
change_per_call
和change_per_min
我收到了以下摘要结果
summary(tele6$change_per_call)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0353 0.2914 0.4377 0.8324 0.7084 366.9400
>
> summary(tele6$change_per_min)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0100 0.0974 0.1409 0.3300 0.2276 622.3750
>
@Ben Bolker,现在我没有“0”,没有“无限”但仍然在glm.fit中显示错误
答案 0 :(得分:0)
由于您的大多数数据都是数字,因此您可以尝试缩放()数据,也可以使用MASS包中的lda()。 LDA和Logistic主要但并不总是给出相同的结果。
我想你已经试着看看这两个变量的范围()是多少,以检查它们是否有异常大或小的值。