我不知道为什么,但是我的模型将所有内容预测为FALSE,这显然不是预测测试数据的正确方法。
数据结构:
$ Anrede : Factor w/ 4 levels "Familie","Firma",..: 3 4 4 4 4 3 3 3 4 3 ...
$ KontaktPerTelefon : num 1 0 1 1 1 1 1 1 1 0 ...
$ KontaktPerEmail : num 1 1 1 1 1 1 1 1 1 1 ...
$ JahresbeitragBrutto: num 60 25 60 12 60 60 24 24 48 48 ...
$ EMailBoolean : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ Jahreszeit : Factor w/ 4 levels "Frühling","Herbst",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Tageszeit : Factor w/ 4 levels "Abend","Mittag",..: 1 3 3 4 3 3 3 2 1 1 ...
$ Organisation : Factor w/ 3 levels "BRK","DRK","MHD": 1 1 1 1 1 1 1 1 1 1 ...
$ Alter : num 48.1 56.1 32.3 63.8 34.5 ...
$ StornoBoolean : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
R代码建模
set.seed(101)
sample <- sample.split(df_data_modeling$StornoBoolean, SplitRatio = 0.70)
train = subset(df_data_modeling, sample == TRUE)
test = subset(df_data_modeling, sample == FALSE)
model = glm(StornoBoolean ~ ., family = binomial(logit), data = train)
解决方案在这里给出。几乎每个变量都很重要!
Deviance Residuals:
Min 1Q Median 3Q Max
-6.5697 -0.6222 -0.5220 -0.4229 2.9912
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7540186 0.0695698 -10.838 < 2e-16 ***
AnredeFirma -0.1354145 0.1008984 -1.342 0.17957
AnredeFrau 0.4519410 0.0517078 8.740 < 2e-16 ***
AnredeHerr 0.2772757 0.0519187 5.341 9.27e-08 ***
KontaktPerTelefon 0.1023211 0.0223885 4.570 4.87e-06 ***
KontaktPerEmail 0.1066560 0.0228986 4.658 3.20e-06 ***
JahresbeitragBrutto 0.0008593 0.0001412 6.088 1.15e-09 ***
EMailBooleanTRUE -0.2772308 0.0226086 -12.262 < 2e-16 ***
JahreszeitHerbst -0.4084937 0.0388069 -10.526 < 2e-16 ***
JahreszeitSommer -0.1130239 0.0257069 -4.397 1.10e-05 ***
JahreszeitWinter -0.0632982 0.0424629 -1.491 0.13605
TageszeitMittag 0.1101916 0.0243596 4.524 6.08e-06 ***
TageszeitNachmittag 0.0801742 0.0244504 3.279 0.00104 **
TageszeitVormittag 0.0811602 0.0318205 2.551 0.01075 *
OrganisationDRK -0.2433693 0.0230773 -10.546 < 2e-16 ***
OrganisationMHD 0.1593983 0.0262643 6.069 1.29e-09 ***
Alter -0.0231121 0.0005689 -40.627 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 80553 on 93752 degrees of freedom
Residual deviance: 78042 on 93736 degrees of freedom
AIC: 78076
Number of Fisher Scoring iterations: 4
我的困惑矩阵和代码:
test$predicted.Storno = predict(model, newdata=test, type="response")
table(test$StornoBoolean, test$predicted.Storno > 0.5)
FALSE TRUE
FALSE 33982 8
TRUE 6188 0
我实际上不知道为什么我的预测如此糟糕。有人可以帮我吗?
答案 0 :(得分:1)
I'm not sure what it actually is that you are trying to predict, so it could be that a lot of the variables are significant, as 40000 is a nice large simplest.
But the main-question: why does it predict everything (but 8) as FALSE?
Answer: it doesn't, but you're testing it with test$predicted.Storno > 0.5
. Which is the same as asking: how many cases have more then 50% chance of occurring.
As we can see in your table, only about 15% is TRUE, so it could well be that even the cases with the highest odds remain under 50%. It sounds vague, so let me explain with an example:
Smoking increases your odds of getting lung cancer.
Working in the mines increases your odds of getting lung cancer.
A family history of cancer increases your odds of getting cancer.
What are the odds that a mineworker who smokes and has a family-history of cancer will get lung cancer before he's 50?
His odds are not good, but still this chance would be under 50% I guess, maybe 10%? (in contrast with maybe .2% for the general public).
So if you make a model out if this, the model will say something like predicted=0.1, which you translate to FALSE. And if you run this model on 100 smoking mineworkers with a family history, each of them will have odds <50% of getting cancer: 100 times FALSE. Even though we know statistically, probably 10 of them will get lung cancer. It's just that individually, each of the 100 of them can expect health.
So in your question, you have to know what you are asking for. There are some more statistical analysis's on what value to use exactly, which I don't know enough about, but first you need to know exactly what you are asking.
EDIT:
It's not so much a question of how to edit/tweak your model, but more a question of how to interpret the result you get.
Some examples of what you might ask, and how to get answers:
table(test$StornoBoolean, test$predicted.Storno > 6188/(33982+6188+8)
library(ggplot2); print(ggplot(data=test)+geom_histogram(aes(x=predicted.Storno, fill=StornoBoolean), position='stack'))
答案 1 :(得分:0)
您的数据不平衡。您可以尝试使用诸如smote
之类的过采样/欠采样技术,但您可能要做的最直接的事情就是将正阈值从0.5
更改为较小的值。
这样做的原因是数据偏向0
,所以输出也将偏斜,因为这是优化损耗函数的最佳方法。
换句话说,算法可以从否定类中学到很多东西,但不能从肯定类中学到很多东西,所以当它必须进行预测时,很少会有超过0.5
个肯定的证据,因此说“ ”很有帮助,“我不需要0.5
个肯定的类的证据,我只需要(说)0.2
” 。您可以反过来考虑:由于该算法对否定类别有更多的了解,因此0.2
的输出足以证明否定类别,因此我应该预测为肯定