我有4个连续变量x_1至x_4,每个变量通过原始数据的最小-最大缩放分布在[0,1]范围内。我正在使用LogisticRegressionCV()将类的标签预测为“ 1”或“ 0”。
什么不起作用?好吧,我的LogisticRegressionCV()预测所有分类的类型均为“ 1”,而人们可以清楚地看到不是这种情况。
num_min_max = np.column_stack((x_1, x_2, x_3, x_4))
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_indices, test_indices in split.split(num_min_max, y):
x_train = num_min_max[train_indices]
y_train = y[train_indices]
x_test = num_min_max[test_indices]
y_test = y[test_indices]
reg = LogisticRegressionCV(Cs=[0.01],
fit_intercept=True,
cv=5,
dual=False,
penalty='l2',
scoring=None,
solver='lbfgs',
tol=0.0001,
max_iter=100,
class_weight=None,
n_jobs=-1,
verbose=0,
refit=True,
intercept_scaling=1.0,
multi_class='auto',
random_state=0,
l1_ratios=None)
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
print(" ")
print(" set(y_test) ", set(y_test) )
print(" set(y_pred) ", set(y_pred))
print(" ")
accuracy = accuracy_score(y_test, y_pred)*100
precision = precision_score(y_test, y_pred)*100
recall = recall_score(y_test, y_pred)*100
f1score = f1_score(y_test, y_pred)*100
print(accuracy, precision, recall, f1score)
print(classification_report(y_test, y_pred))
`
我有以下问题
输出
set(y_test) {1, 0}
set(y_pred) {1}
89.7196261682243 89.7196261682243 100.0 94.58128078817734
precision recall f1-score support
0 0.00 0.00 0.00 55
1 0.90 1.00 0.95 480
accuracy 0.90 535
macro avg 0.45 0.50 0.47 535
weighted avg 0.80 0.90 0.85 535
~/.local/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1268: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
SMOTE + same settings for LogisticRegressionCV
precision recall f1-score support
0 0.63 0.73 0.67 381
1 0.68 0.57 0.62 385
accuracy 0.65 766
macro avg 0.65 0.65 0.65 766
weighted avg 0.65 0.65 0.65 766
带有LogisticRegressionCV的SMOTE的代码。通过here
num = np.column_stack(num_min_max)
os = SMOTE(random_state=0)
x_train, x_test, y_train, y_test = train_test_split(num, y, test_size=0.2, random_state=0)
os_data_x, os_data_y = os.fit_sample(x_train, y_train)
os_data_X = pd.DataFrame(data=os_data_x,columns=['x1', 'x2', 'x3', 'x4'] )
os_data_Y = pd.DataFrame(data=os_data_y,columns=['y'])
# check for p values
logit_model=sm.Logit(os_data_Y.values.ravel(),os_data_X)
result=logit_model.fit()
print(result.summary2())
os_data_X.drop(["x2", "x4"], axis=1, inplace=True) # based on p-value
x_train, x_test, y_train, y_test = train_test_split(os_data_X, os_data_Y.values.ravel(), test_size=0.2, random_state=0)
classifier = clone(reg)
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
print('Accuracy of classifier on test set: {:.2f}'.format(classifier.score(x_test, y_test)))
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))
Accuracy of classifier on test set: 0.71
[[ 35 15]
[209 276]]
precision recall f1-score support
0 0.14 0.70 0.24 50
1 0.95 0.57 0.71 485
accuracy 0.58 535
macro avg 0.55 0.63 0.47 535
weighted avg 0.87 0.58 0.67 535
答案 0 :(得分:1)
您的数据似乎不平衡,从精度召回表中我们可以看到,类1
的贡献接近总数据的90%
。解决类不平衡问题的方法有多种,您可以参考此blog以获得详细的解决方案。
解决此问题的一种快速解决方案是在模型中添加类权重(到目前为止,它是代码中的默认值None
),这基本上意味着您在对模型进行惩罚时会对其加倍惩罚在预测类0
时比在类1
时出错。首先,您可以将类权重值从None
更改为balanced
,然后查看其效果。
但是与此同时,您应该注意增加班级权重也会损害班级1
的表现,这基本上是您需要权衡的。
希望这会有所帮助!