Question

我有4个连续变量x_1至x_4，每个变量通过原始数据的最小-最大缩放分布在[0，1]范围内。我正在使用LogisticRegressionCV（）将类的标签预测为“ 1”或“ 0”。

什么不起作用？好吧，我的LogisticRegressionCV（）预测所有分类的类型均为“ 1”，而人们可以清楚地看到不是这种情况。

num_min_max = np.column_stack((x_1, x_2, x_3, x_4))
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
for train_indices, test_indices in split.split(num_min_max, y):
    x_train = num_min_max[train_indices]
    y_train = y[train_indices]
    x_test  = num_min_max[test_indices]
    y_test  = y[test_indices]
reg = LogisticRegressionCV(Cs=[0.01],
                           fit_intercept=True, 
                           cv=5, 
                           dual=False, 
                           penalty='l2', 
                           scoring=None, 
                           solver='lbfgs', 
                           tol=0.0001, 
                           max_iter=100, 
                           class_weight=None, 
                           n_jobs=-1, 
                           verbose=0, 
                           refit=True, 
                           intercept_scaling=1.0, 
                           multi_class='auto', 
                           random_state=0, 
                           l1_ratios=None)
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
print(" ")
print(" set(y_test) ", set(y_test) )
print(" set(y_pred) ", set(y_pred))
print(" ")
accuracy  = accuracy_score(y_test, y_pred)*100
precision = precision_score(y_test, y_pred)*100
recall    = recall_score(y_test, y_pred)*100
f1score   = f1_score(y_test, y_pred)*100
print(accuracy, precision, recall, f1score)
print(classification_report(y_test, y_pred))
`

我有以下问题

LogisticRegressionCV是否适合此工作？因为它可以很好地处理一键编码的数据。
它可以处理连续数据吗？我猜是这样。
我是否为LogisticRegressionCV设置任何参数不正确？您能提出更好或更整洁的建议吗？
如何上传其他人可以用来重现此示例的数据？
最后，我做错什么了吗？

输出


 set(y_test)  {1, 0}
 set(y_pred)  {1}

89.7196261682243 89.7196261682243 100.0 94.58128078817734

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        55
           1       0.90      1.00      0.95       480

    accuracy                           0.90       535
   macro avg       0.45      0.50      0.47       535
weighted avg       0.80      0.90      0.85       535

~/.local/lib/python3.6/site-packages/sklearn/metrics/_classification.py:1268: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

SMOTE + same settings for LogisticRegressionCV

              precision    recall  f1-score   support

           0       0.63      0.73      0.67       381
           1       0.68      0.57      0.62       385

    accuracy                           0.65       766
   macro avg       0.65      0.65      0.65       766
weighted avg       0.65      0.65      0.65       766

带有LogisticRegressionCV的SMOTE的代码。通过here

num = np.column_stack(num_min_max)

os = SMOTE(random_state=0)
x_train, x_test, y_train, y_test = train_test_split(num, y, test_size=0.2, random_state=0)

os_data_x, os_data_y = os.fit_sample(x_train, y_train)
os_data_X = pd.DataFrame(data=os_data_x,columns=['x1', 'x2', 'x3', 'x4'] )
os_data_Y = pd.DataFrame(data=os_data_y,columns=['y'])

# check for p values
logit_model=sm.Logit(os_data_Y.values.ravel(),os_data_X)
result=logit_model.fit()
print(result.summary2())

os_data_X.drop(["x2", "x4"], axis=1, inplace=True) # based on p-value
x_train, x_test, y_train, y_test = train_test_split(os_data_X, os_data_Y.values.ravel(), test_size=0.2, random_state=0)


classifier = clone(reg)
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)
print('Accuracy of classifier on test set: {:.2f}'.format(classifier.score(x_test, y_test)))

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))

Accuracy of classifier on test set: 0.71
[[ 35  15]
 [209 276]]
              precision    recall  f1-score   support

           0       0.14      0.70      0.24        50
           1       0.95      0.57      0.71       485

    accuracy                           0.58       535
   macro avg       0.55      0.63      0.47       535
weighted avg       0.87      0.58      0.67       535

Answer 1

您的数据似乎不平衡，从精度召回表中我们可以看到，类1的贡献接近总数据的90%。解决类不平衡问题的方法有多种，您可以参考此blog以获得详细的解决方案。

解决此问题的一种快速解决方案是在模型中添加类权重（到目前为止，它是代码中的默认值None），这基本上意味着您在对模型进行惩罚时会对其加倍惩罚在预测类0时比在类1时出错。首先，您可以将类权重值从None更改为balanced，然后查看其效果。

但是与此同时，您应该注意增加班级权重也会损害班级1的表现，这基本上是您需要权衡的。

希望这会有所帮助！

LogisticRegressionCV错误地预测标签

1 个答案: