(机器学习)更高的多项式度+数据平衡=灾难?

时间:2015-07-13 11:50:25

标签: python numpy machine-learning weka logistic-regression

数据集:https://www.kaggle.com/c/GiveMeSomeCredit/data(cs-training.csv)

培训工具:Weka

数据处理工具:Python(用于更高的多项式度)

算法:Logistic回归

问题:平衡数据(通过向下采样多数类以使得50:50)或引入更高的多项式度数确实可以提高F-度量,同时略微牺牲准确性。

然而,平衡数据和引入更高的多项式度严重破坏了准确性。

查看下面的结果,是一般情况还是我的实验设计有问题?

1. Original Imbalanced Data (90,000) + Polynomial degree of (1)
Accuracy: 93%  F-Measure: 0.074  ROC Area: 0.698
y      n  <-- classified as
79   1,927| y (actual)
60  27,934| n (actual)

2. Original Imbalanced Data (90,000) + Polynomial degree of (2)
Accuracy: 93%  F-Measure: 0.329  ROC Area: 0.806
y      n  <-- classified as
480   1,526| y (actual)
431  27,563| n (actual)

3. Balanced Data (11,942) + Polynomial degree of (1)
Accuracy: 67%  F-Measure: 0.207  ROC Area: 0.73
y      n  <-- classified as
1,292     714| y (actual)
9,157  18,837| n (actual)

4. Balanced Data (11,942) + Polynomial degree of (2)
Accuracy: 6%  F-Measure: 0.123  ROC Area: 0.7
y      n  <-- classified as
1,965   41| y (actual)
27,954  40| n (actual)

0 个答案:

没有答案