Question

我在一个非常小且简单的数据集上运行Logistic回归，该数据集可以很好地分离。但我意识到模型找不到最优决策边界。我的错误在哪里？

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

sm_df = pd.DataFrame()
sm_df['x'] = [0.5,4.0,1.0,2.5,2.0,3.5,1.0,3.0, 1.0, 2.0]
sm_df['y'] = [1.0,3.5,1.0,3.5,1.0, 4.5, 2.0,3.0, 0.0, 2.5]
sm_df['Bad_data'] = [True, False, True, False, True, False, True, False, True, False]

log = linear_model.LogisticRegression()

log.fit(sm_df[['x','y']], sm_df['Bad_data'])
test_score = log.score(sm_df[['x','y']], sm_df['Bad_data'])
print("test score: ", test_score)

# Create scatterplot of dataframe
sns.lmplot('x', # Horizontal axis
           'y', # Vertical axis
           data=sm_df, # Data source
           fit_reg=False, # Don't fix a regression line
           hue="Bad_data", # Set color
           scatter_kws={"marker": "D", # Set marker style
                        "s": 100}) # S marker size

plt.xlabel('x')
plt.ylabel('y')

# to plot desision bountdary
w0 = log.intercept_
w1, w2 = log.coef_[0]

X = np.array([0,4])
x2 = np.array([-w0/w2, -w0/w2 -w1*4/w2])
plt.plot(X, x2)
t_x = [1.5]
t_y = [1.8]
pr = log.predict([1.5,1.8])
plt.scatter(t_x, # Horizontal axis
           t_y, c='r') # S marker size
plt.annotate(pr, ([1.5,1.9]))

my plot:

Answer 1

原因是因为错误不是模型唯一受到惩罚的东西 - 还有一个正则化术语。如果使用

之类的东西使正则化项更小

log = linear_model.LogisticRegression(C=10.)

然后在此示例中将正确分类所有点。那是因为该模型将更加关注正确地对点进行分类，而对正则化的分类则相对较少。这里参数C是正则化强度的倒数，默认为1。

这里必要的部分原因是您的数据未标准化。如果在应用逻辑回归之前标准化数据（给出x和y的零均值和方差为1），那么你也可以与C=1完美契合。你可以用

之类的东西来做到这一点

sm_df['x'] = (sm_df['x'] - sm_df['x'].mean()) / sm_df['x'].std()
sm_df['y'] = (sm_df['y'] - sm_df['y'].mean()) / sm_df['y'].std()

逻辑回归并没有找到最优决策边界

1 个答案: