我测量多元线性回归模型的性能是否正确?

时间:2018-02-25 23:38:29

标签: python machine-learning regression

这可能是一个有点愚蠢的问题(可能是微不足道的问题),但我是机器学习的新手。这可以很容易地从我提出的代码中推断出来,并且它不是一个制定得不好的问题的借口。如果您发现这个问题制定得很差,请通知我,以便我可以更新。

我训练了一个多元线性回归模型,我想看看它对给定数据集的执行情况。所以,我搜索了一下,我发现a nice article解释了我如何找出预测值的“错误”,从真实的。它给我的几个选项是:enter image description here

我应用了所有这些,他们给了我非常高的价值,所以我不知道这些是否正确或我应该如何解释它们。

输出文章正在接收:

  • 10.0
  • 150.0
  • 12.2474487139

我的模型收到的输出:

  • 7514.293659640891
  • 83502864.03257468
  • 9137.990152794797

作为快速参考,这些是我的真实/预测值 enter image description here

'TLDR'问题:我是否正确使用上述方法测量错误,这些结果是否意味着我的模型表现非常糟糕? (当我将预测与真实值进行比较时,这似乎不是这样)

Here您可以查看我正在使用的数据集。

我用来创建模型和预测值的代码(我试图删除不需要的代码)

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn import metrics

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values # Independent variables
y = dataset.iloc[:, 4].values # Dependent variable

# Encode categorical data into numerical values (1, 2, 3)
# For example; New york becomes 1 and Florida becomes 2
labelencoder_states = LabelEncoder()
# We just want to apply this to the state column, since this has categorical data
states_encoded = labelencoder_states.fit_transform(X[:, 3])
# Update the states with the new encoded data
X[:, 3] = states_encoded

# Now that we have the categories as numerical data, 
# we can split them into multiple dummy variables:
# Split the categories into columns (more optimal)
# Tell it too look at the state column
onehotencoder_states = OneHotEncoder(categorical_features = [3])
# Actually transforms them into columns
X = onehotencoder_states.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
# Remove the first column from X
# Since; dummy variables -1
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
# In this case we are going to use 40 of the 50 records for training
# and ten of the 50 for testing, hence the 0.2 split ratio
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create a regressor
regressor = LinearRegression()
# Fit the model to the training data
regressor.fit(X_train, y_train)

# Make predictions on the test set, using our model
y_pred = regressor.predict(X_test)

# Evaluating the model (Am I doing this correct?)

# How well did it do?
print(metrics.mean_absolute_error(y_test, y_pred))
print(metrics.mean_squared_error(y_test, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1 个答案:

答案 0 :(得分:1)

让我们回答一下: 我认为你正在测量(至少用代码)。但是:

  1. 谁告诉你这种关系是线性的?你正试图预测利润(对吗?)。我会说线性回归可能不会很好。所以,如果你没有取得好成绩,我并不感到惊讶。

  2. 要了解您的预测是如何运作的,请尝试绘制预测与实际情况,并检查您的积分在一条线上的保留程度。

  3. 总结一下:你获得大值的事实并不意味着你的代码是错误的。很可能这种关系不是线性的。

    旁注:使用分类变量可能是问题的根源。您是否尝试过无状态的线性回归?你有什么结果?哪个变量在回归中最重要?你应该检查一下。你的R平方是什么?

    我希望这有帮助,翁贝托