所以我有一个小的数据集,我想对它执行多元线性回归。
首先,我删除交货列,因为它与里程高度相关。尽管应该删除gasprice,但我不会删除它,以便可以执行多重线性回归而不是简单的线性回归。 最后,我删除了异常值并执行以下操作:
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
代码到此结束。每次打印时,我都会发现不同的系数。我做错了什么,其中任何一个是正确的吗?
答案 0 :(得分:0)
我看到您在这里尝试3种不同的方法,所以让我总结一下:
sklearn.linear_model.LinearRegression()
与train_test_split(X, Y, test_size=0.2, random_state=1)
,因此仅使用80%的数据(但由于您固定了随机状态,因此每次运行拆分时,拆分应该相同)statsmodels.api.OLS
和 full 数据集(您通过了X2
和Y
,它们没有进行训练测试)sklearn.linear_model.LinearRegression()
和 full 数据集,如n2所示。我尝试使用iris
数据集进行重现,对于案例2和案例3(在相同的准确数据上进行训练),我得到的结果相同,而案例1的系数却略有不同。 / p>
为了评估其中任何一个是否“正确”,您将需要根据看不见的数据评估模型,并查看调整后的R ^ 2得分等(因此需要保持(测试)设置)。如果要进一步改进模型,可以尝试更好地了解线性模型中要素的相互作用。 Statsmodels具有简洁的“类似于R”的公式方式来指定您的模型:https://www.statsmodels.org/dev/example_formulas.html