Question

我正在训练集上的MSE ，因此我希望使用较高的多项式时，MSE会降低。但是，从4级到5级，MSE显着增加。可能是什么原因？

import pandas as pd, numpy as np
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

path = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv"
df = pd.read_csv(path)
r=[]
max_degrees = 10

y = df['price'].astype('float')
x = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']].astype('float')

for i in range(1,max_degrees+1):
    Input = [('scale', StandardScaler()), ('polynomial', PolynomialFeatures(degree=i)), ('model', LinearRegression())]
    pipe = Pipeline(Input)
    pipe.fit(x,y)
    yhat = pipe.predict(x)
    r.append(mean_squared_error(yhat, y))
    print("MSE for MLR of degree "+str(i)+" = "+str(round(mean_squared_error(yhat, y)/1e6,1)))

plt.figure(figsize=(10,3))
plt.plot(list(range(1,max_degrees+1)),r)
plt.show()

结果：

Answer 1

最初，您在y中有200个观测值，在X中有4个特征（列），然后将它们缩放并转换为多项式特征。

度数4因此具有120 <200多项式特征，而度数5是第一个具有210> 200多项式特征的特征，即特征要多于观察值。

如果特征多于观测值，则线性回归是不确定的，不应该使用，如here所述。这可以解释为什么从4级升到5级时，火车的拟合突然变差。对于更高的度，似乎LR求解器仍然能够拟合火车数据。

多项式越高，多元线性回归的精度是否更高？

1 个答案: