我在一个非常小且简单的数据集上运行Logistic回归,该数据集可以很好地分离。但我意识到模型找不到最优决策边界。我的错误在哪里?
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
sm_df = pd.DataFrame()
sm_df['x'] = [0.5,4.0,1.0,2.5,2.0,3.5,1.0,3.0, 1.0, 2.0]
sm_df['y'] = [1.0,3.5,1.0,3.5,1.0, 4.5, 2.0,3.0, 0.0, 2.5]
sm_df['Bad_data'] = [True, False, True, False, True, False, True, False, True, False]
log = linear_model.LogisticRegression()
log.fit(sm_df[['x','y']], sm_df['Bad_data'])
test_score = log.score(sm_df[['x','y']], sm_df['Bad_data'])
print("test score: ", test_score)
# Create scatterplot of dataframe
sns.lmplot('x', # Horizontal axis
'y', # Vertical axis
data=sm_df, # Data source
fit_reg=False, # Don't fix a regression line
hue="Bad_data", # Set color
scatter_kws={"marker": "D", # Set marker style
"s": 100}) # S marker size
plt.xlabel('x')
plt.ylabel('y')
# to plot desision bountdary
w0 = log.intercept_
w1, w2 = log.coef_[0]
X = np.array([0,4])
x2 = np.array([-w0/w2, -w0/w2 -w1*4/w2])
plt.plot(X, x2)
t_x = [1.5]
t_y = [1.8]
pr = log.predict([1.5,1.8])
plt.scatter(t_x, # Horizontal axis
t_y, c='r') # S marker size
plt.annotate(pr, ([1.5,1.9]))
答案 0 :(得分:1)
原因是因为错误不是模型唯一受到惩罚的东西 - 还有一个正则化术语。如果使用
之类的东西使正则化项更小log = linear_model.LogisticRegression(C=10.)
然后在此示例中将正确分类所有点。那是因为该模型将更加关注正确地对点进行分类,而对正则化的分类则相对较少。这里参数C
是正则化强度的倒数,默认为1。
这里必要的部分原因是您的数据未标准化。如果在应用逻辑回归之前标准化数据(给出x和y的零均值和方差为1),那么你也可以与C=1
完美契合。你可以用
sm_df['x'] = (sm_df['x'] - sm_df['x'].mean()) / sm_df['x'].std()
sm_df['y'] = (sm_df['y'] - sm_df['y'].mean()) / sm_df['y'].std()