数据集:https://www.kaggle.com/c/GiveMeSomeCredit/data(cs-training.csv)
培训工具:Weka
数据处理工具:Python(用于更高的多项式度)
算法:Logistic回归
问题:平衡数据(通过向下采样多数类以使得50:50)或引入更高的多项式度数确实可以提高F-度量,同时略微牺牲准确性。
然而,平衡数据和引入更高的多项式度严重破坏了准确性。
查看下面的结果,是一般情况还是我的实验设计有问题?
1. Original Imbalanced Data (90,000) + Polynomial degree of (1)
Accuracy: 93% F-Measure: 0.074 ROC Area: 0.698
y n <-- classified as
79 1,927| y (actual)
60 27,934| n (actual)
2. Original Imbalanced Data (90,000) + Polynomial degree of (2)
Accuracy: 93% F-Measure: 0.329 ROC Area: 0.806
y n <-- classified as
480 1,526| y (actual)
431 27,563| n (actual)
3. Balanced Data (11,942) + Polynomial degree of (1)
Accuracy: 67% F-Measure: 0.207 ROC Area: 0.73
y n <-- classified as
1,292 714| y (actual)
9,157 18,837| n (actual)
4. Balanced Data (11,942) + Polynomial degree of (2)
Accuracy: 6% F-Measure: 0.123 ROC Area: 0.7
y n <-- classified as
1,965 41| y (actual)
27,954 40| n (actual)