试图了解逻辑回归的使用。我有以下数据:
Gender Age No.transcation Transaction
female 18-24 138485 4047
male 18-24 144301 3766
female 25-34 248362 7559
male 25-34 295800 8126
female 35-44 265514 7171
male 35-44 379872 9047
female 45-54 295002 7072
male 45-54 421432 9648
female 55-64 382198 7529
male 55-64 456308 9016
female 65+ 352501 4856
male 65+ 465253 6889
在R中运行逻辑回归我得到以下摘要输出
> mod2 <- glm(cbind(Transaction, No.transcation) ~ Gender + Age, data = csvd,
family = binomial())
> summary(mod2)
Call:
glm(formula = cbind(Transaction, No.transcation) ~ Gender + Age,
family = binomial(), data = csvd)
Deviance Residuals:
1 2 3 4 5 6
1.8732 -1.9018 2.2654 -2.1473 3.4810 -3.0228
7 8 9 10 11 12
-0.2772 0.2377 -2.5500 2.3717 -4.9638 4.3408
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.562800 0.011984 -297.290 < 2e-16 ***
Gendermale -0.051852 0.006993 -7.415 1.22e-13 ***
Age25-34 0.044091 0.014042 3.140 0.00169 **
Age35-44 -0.090757 0.013966 -6.499 8.11e-11 ***
Age45-54 -0.164705 0.013894 -11.855 < 2e-16 ***
Age55-64 -0.334841 0.013900 -24.088 < 2e-16 ***
Age65+ -0.651142 0.014767 -44.094 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4490.792 on 11 degrees of freedom
Residual deviance: 93.866 on 5 degrees of freedom
AIC: 235.5
Number of Fisher Scoring iterations: 3
指数系数得到比值比,我发现它们几乎只与用户与交易的比率相同:
> exp(summary(mod2)$coefficients)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.02835931 1.012056 7.735499e-130 1.000000
Gendermale 0.94946976 1.007018 6.022806e-04 1.000000
Age25-34 1.04507762 1.014141 2.310243e+01 1.001691
Age35-44 0.91323954 1.014064 1.505641e-03 1.000000
Age45-54 0.84814413 1.013991 7.106341e-06 1.000000
Age55-64 0.71545181 1.013998 3.455562e-11 1.000000
Age65+ 0.52145005 1.014877 7.084264e-20 1.000000
比较优势比率,只考虑用户与交易的相对比率除以每组总用户数(并将其与男性和18-24基本组进行比较),得到的数字几乎相同:
female
male 94.68%
18-24
25-34 104.21%
35-44 91.17%
45-54 84.82%
55-64 71.97%
65+ 52.66%
那么甚至在这里运行逻辑回归的重点是什么?此数据集只有2个功能,但也可以扩展到50个功能。在这种情况下,LR与仅查看每组的比率有什么用?是因为所有变量都是名义上的,它并没有增加多少?
答案 0 :(得分:2)
你希望估计的优势比接近这样的实现比例。您正在估计概率pr(Y = 1 | X = x);给定年龄愤怒和性别的交易概率。对于像这样的分类预测器,直观的估计器将是数据中结果的比例。当预测变量是连续变量时,逻辑回归变得更有趣,并且您希望预测您尚未观察到的某些预测变量值的结果概率。在这些情况下,LR允许您将预测器的无界线性函数映射到概率,概率必须在0到1之间。