需要帮助来了解Python线性回归模型代码问题(sklearn)

时间:2020-09-06 17:01:23

标签: python machine-learning scikit-learn

我正在使用带有Tim视频(https://www.youtube.com/watch?v=45ryDIPHdGg)的Tech开发第一个线性回归代码,但遇到了麻烦。我正在从这里使用UCI学生数据:https://archive.ics.uci.edu/ml/datasets/Student+Performance

我的初始模型代码运行良好。然后,我反复寻找一个最佳精度模型,这很好。它开始失误的地方是我试图将这些最佳系数注入新模型,然后进行两个预测:

  1. 第一个模型(预优化循环)

  2. 优化模型

针对相同的x_test1数据集。为了比较两者,我只是将预测值和实际y值之间的平方差求和。然后,我还记录了两个模型的最终精度。

我做错了什么,因为我的新“优化”模型的准确性与第一个模型相同或更低,并且差值也非常相似。我希望优化后的模型具有更少的错误和更高的准确性。

有人可以帮助我查看错误吗?我怀疑错误位于代码的plot部分之后。预先感谢,下面的代码。

# Import libraries
import pandas as pd
import numpy as np
import sklearn
import pickle
import matplotlib.pyplot as plt
from sklearn import linear_model
from math import sqrt
from sklearn.linear_model import LinearRegression
from matplotlib import style

# from sklearn.utils import shuffle

# Read in Data
data = pd.read_csv("student-mat.csv", sep=";")

# Slice data to include only desired headings
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]

# Define the attribute we are trying to predict; called "label".
# Others are "features" and used to predict label
predict = "G3"

# Create array of features and label
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

# Split data into training and testing data.  90% used for training, 10% testing
# Test size 0.1 = 10% of array size
x_train1, x_test1, y_train1, y_test1 = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

# Create 1st linear model and fit
linear = linear_model.LinearRegression()
linear.fit(x_train1, y_train1)

# Compute accuracy of model
acc = linear.score(x_test1, y_test1)

# Iterate for a given number of times (max_iter) to find an optimal accuracy value and record best coefficients
loop_num = 1
max_iter = 1000
best_acc = acc
best_coef = linear.coef_
best_int = linear.intercept_
acc_counter = [acc]

print("\nInitial Accuracy: %4.3f" % acc)

while loop_num < max_iter + 1:
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
    linear2 = linear_model.LinearRegression()
    linear2.fit(x_train, y_train)
    acc = linear2.score(x_test, y_test)
    acc_counter.append(acc)
    print("\nAccuracy of run " + str(loop_num) + " is: %4.3f" % acc)
    if acc > best_acc:
        print("\n\tBetter accuracy found.")
        best_acc = acc
        best_coef = linear2.coef_
        best_int = linear2.intercept_
        print("Co: \n", linear2.coef_)
        print("Intercept: \n", linear2.intercept_)
    else:
        print("\n\tFit Discarded.")
    loop_num += 1

print("\nBest Acccuracy: \n%4.3f" % best_acc)
print("\nBest Coefficients: \n", best_coef)
print("\nBest Intercept: \n", best_int)

# Plot Accuracy over time
x_scale = []
for x in range(max_iter + 1):
    x_scale.append(x)

plt.plot(x_scale, acc_counter, color='green', linestyle='dashed', linewidth=3, marker='o',
         markerfacecolor='blue', markersize=5)

ymax = max(acc_counter)
ymin = min(acc_counter)
xpos = acc_counter.index(ymax)
xmax = x_scale[xpos]
annot_max_acc = str(ymax)
plt.annotate('Max Accuracy = ' + annot_max_acc[0:4], xy=(xmax, ymax), xycoords='data', xytext=(.8, .95),
             textcoords='axes fraction',
             arrowprops=dict(facecolor='black', shrink=0.05), horizontalalignment='right', verticalalignment='top')
plt.ylim(ymin, 1.0)
plt.xlabel('Run Number')
plt.ylabel('Accuracy')
plt.title('Prediction Accuracy over Time')
plt.show()

# Create model with best coefficients from above
new_model = linear_model.LinearRegression()
new_model.intercept_ = best_int
new_model.coef_ = best_coef

# Predict y values for 1st model (not best) then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp = []
predictions = linear.predict(x_test1)
for x in range(len(predictions)):
    print(predictions[x], x_test1[x], y_test1[x])
    diff = sqrt((predictions[x] - y_test1[x])**2)
    print("\tDifference is ", diff)
    comp.append(diff)
print("\n\n\nBREAK")
print(comp)
print("\nSum of errors is ", sum(comp))

# Predict y values of best model (with optimal coefficients from above) using same x_test1 values as 1st model
# then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp2 = []
predictions_new_model = new_model.predict(x_test1)
for x in range(len(predictions_new_model)):
    print(predictions_new_model[x], x_test1[x], y_test1[x])
    diff2 = sqrt((predictions_new_model[x] - y_test1[x])**2)
    print("\tDifference is ", diff2)
    comp2.append(diff2)

print("\n\n\nBREAK")
print(comp2)
print("\nSum of errors is ", sum(comp2))

print("\n\n\nFirst model fit difference: ", sum(comp))
print("\nSecond model fit difference ", sum(comp2))

print('\n1st model score: ',linear.score(x_train1, y_train1))

print('\nBest model score: ',new_model.score(x_train1, y_train1))

1 个答案:

答案 0 :(得分:-1)

查看您的代码,我刚刚意识到您正在使用相同的模型(LinearRegression),并且在任何运行中均未更改任何超参数,因此实际上没有任何改进,而不同之处在于您已经拆分了数据两次(您没有给它随机种子),因此略有不同。为了改善模型,您必须更改估计器的超参数。在此处查看更多信息:hyperparameter tuning