Question

我不知道为什么，但是我的模型将所有内容预测为FALSE，这显然不是预测测试数据的正确方法。

数据结构：

$ Anrede             : Factor w/ 4 levels "Familie","Firma",..: 3 4 4 4 4 3 3 3 4 3 ...
 $ KontaktPerTelefon  : num  1 0 1 1 1 1 1 1 1 0 ...
 $ KontaktPerEmail    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ JahresbeitragBrutto: num  60 25 60 12 60 60 24 24 48 48 ...
 $ EMailBoolean       : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
 $ Jahreszeit         : Factor w/ 4 levels "Frühling","Herbst",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ Tageszeit          : Factor w/ 4 levels "Abend","Mittag",..: 1 3 3 4 3 3 3 2 1 1 ...
 $ Organisation       : Factor w/ 3 levels "BRK","DRK","MHD": 1 1 1 1 1 1 1 1 1 1 ...
 $ Alter              : num  48.1 56.1 32.3 63.8 34.5 ...
 $ StornoBoolean      : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...

R代码建模

set.seed(101) 
sample <- sample.split(df_data_modeling$StornoBoolean, SplitRatio = 0.70) 
train = subset(df_data_modeling, sample == TRUE)
test = subset(df_data_modeling, sample == FALSE)
model = glm(StornoBoolean ~ ., family = binomial(logit), data = train)

解决方案在这里给出。几乎每个变量都很重要！

   Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-6.5697  -0.6222  -0.5220  -0.4229   2.9912  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -0.7540186  0.0695698 -10.838  < 2e-16 ***
AnredeFirma         -0.1354145  0.1008984  -1.342  0.17957    
AnredeFrau           0.4519410  0.0517078   8.740  < 2e-16 ***
AnredeHerr           0.2772757  0.0519187   5.341 9.27e-08 ***
KontaktPerTelefon    0.1023211  0.0223885   4.570 4.87e-06 ***
KontaktPerEmail      0.1066560  0.0228986   4.658 3.20e-06 ***
JahresbeitragBrutto  0.0008593  0.0001412   6.088 1.15e-09 ***
EMailBooleanTRUE    -0.2772308  0.0226086 -12.262  < 2e-16 ***
JahreszeitHerbst    -0.4084937  0.0388069 -10.526  < 2e-16 ***
JahreszeitSommer    -0.1130239  0.0257069  -4.397 1.10e-05 ***
JahreszeitWinter    -0.0632982  0.0424629  -1.491  0.13605    
TageszeitMittag      0.1101916  0.0243596   4.524 6.08e-06 ***
TageszeitNachmittag  0.0801742  0.0244504   3.279  0.00104 ** 
TageszeitVormittag   0.0811602  0.0318205   2.551  0.01075 *  
OrganisationDRK     -0.2433693  0.0230773 -10.546  < 2e-16 ***
OrganisationMHD      0.1593983  0.0262643   6.069 1.29e-09 ***
Alter               -0.0231121  0.0005689 -40.627  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 80553  on 93752  degrees of freedom
Residual deviance: 78042  on 93736  degrees of freedom
AIC: 78076

Number of Fisher Scoring iterations: 4

我的困惑矩阵和代码：

test$predicted.Storno = predict(model, newdata=test, type="response")
table(test$StornoBoolean, test$predicted.Storno > 0.5)




    FALSE  TRUE
  FALSE 33982     8
  TRUE   6188     0

我实际上不知道为什么我的预测如此糟糕。有人可以帮我吗？

Answer 1

I'm not sure what it actually is that you are trying to predict, so it could be that a lot of the variables are significant, as 40000 is a nice large simplest.

But the main-question: why does it predict everything (but 8) as FALSE?
Answer: it doesn't, but you're testing it with test$predicted.Storno > 0.5. Which is the same as asking: how many cases have more then 50% chance of occurring. As we can see in your table, only about 15% is TRUE, so it could well be that even the cases with the highest odds remain under 50%. It sounds vague, so let me explain with an example:

Smoking increases your odds of getting lung cancer.
Working in the mines increases your odds of getting lung cancer.
A family history of cancer increases your odds of getting cancer.
What are the odds that a mineworker who smokes and has a family-history of cancer will get lung cancer before he's 50?
His odds are not good, but still this chance would be under 50% I guess, maybe 10%? (in contrast with maybe .2% for the general public).
So if you make a model out if this, the model will say something like predicted=0.1, which you translate to FALSE. And if you run this model on 100 smoking mineworkers with a family history, each of them will have odds <50% of getting cancer: 100 times FALSE. Even though we know statistically, probably 10 of them will get lung cancer. It's just that individually, each of the 100 of them can expect health.

So in your question, you have to know what you are asking for. There are some more statistical analysis's on what value to use exactly, which I don't know enough about, but first you need to know exactly what you are asking.

EDIT:
It's not so much a question of how to edit/tweak your model, but more a question of how to interpret the result you get. Some examples of what you might ask, and how to get answers:

Which members are more likely then average to be TRUE? You can test that by checking which predicted values are more then average, like this: table(test$StornoBoolean, test$predicted.Storno > 6188/(33982+6188+8)
Which members are most likely to be true? `test <- test[order(test$predicted.Storno, decreasing=TRUE),] will order your test results
Checking if your model is (generally) reliable: you can plot the predicted odds against actual ratio.
library(ggplot2); print(ggplot(data=test)+geom_histogram(aes(x=predicted.Storno, fill=StornoBoolean), position='stack'))
If your model would be perfect, at x=0.10, 10% of the full bar should be TRUE, 20% at 0.20, etc. It generally won't be, but you should be able to see the TRUE fraction increase as x increases. If you want to see more clearly what the fraction is, you can use position='fill' in the call, which means all bars are shrinked/extended to the same height, which makes the fraction clearer to see. However, this may give a misleading picture for predicted values that rarely occur, so you should only look at x-values that are reasonably frequent.

Answer 2

您的数据不平衡。您可以尝试使用诸如smote之类的过采样/欠采样技术，但您可能要做的最直接的事情就是将正阈值从0.5更改为较小的值。

这样做的原因是数据偏向0，所以输出也将偏斜，因为这是优化损耗函数的最佳方法。

换句话说，算法可以从否定类中学到很多东西，但不能从肯定类中学到很多东西，所以当它必须进行预测时，很少会有超过0.5个肯定的证据，因此说“ ”很有帮助，“我不需要0.5个肯定的类的证据，我只需要（说）0.2” 。您可以反过来考虑：由于该算法对否定类别有更多的了解，因此0.2的输出足以证明否定类别，因此我应该预测为肯定

R-Logistic回归-模型预测和分割数据非常糟糕。有想法吗？

2 个答案: