Question

这可能是一个有点愚蠢的问题（可能是微不足道的问题），但我是机器学习的新手。这可以很容易地从我提出的代码中推断出来，并且它不是一个制定得不好的问题的借口。如果您发现这个问题制定得很差，请通知我，以便我可以更新。

我训练了一个多元线性回归模型，我想看看它对给定数据集的执行情况。所以，我搜索了一下，我发现a nice article解释了我如何找出预测值的“错误”，从真实的。它给我的几个选项是：

我应用了所有这些，他们给了我非常高的价值，所以我不知道这些是否正确或我应该如何解释它们。

输出文章正在接收：

10.0
150.0
12.2474487139

我的模型收到的输出：

7514.293659640891
83502864.03257468
9137.990152794797

作为快速参考，这些是我的真实/预测值

'TLDR'问题：我是否正确使用上述方法测量错误，这些结果是否意味着我的模型表现非常糟糕？（当我将预测与真实值进行比较时，这似乎不是这样）

Here您可以查看我正在使用的数据集。

我用来创建模型和预测值的代码（我试图删除不需要的代码）

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn import metrics

dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values # Independent variables
y = dataset.iloc[:, 4].values # Dependent variable

# Encode categorical data into numerical values (1, 2, 3)
# For example; New york becomes 1 and Florida becomes 2
labelencoder_states = LabelEncoder()
# We just want to apply this to the state column, since this has categorical data
states_encoded = labelencoder_states.fit_transform(X[:, 3])
# Update the states with the new encoded data
X[:, 3] = states_encoded

# Now that we have the categories as numerical data, 
# we can split them into multiple dummy variables:
# Split the categories into columns (more optimal)
# Tell it too look at the state column
onehotencoder_states = OneHotEncoder(categorical_features = [3])
# Actually transforms them into columns
X = onehotencoder_states.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
# Remove the first column from X
# Since; dummy variables -1
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
# In this case we are going to use 40 of the 50 records for training
# and ten of the 50 for testing, hence the 0.2 split ratio
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Create a regressor
regressor = LinearRegression()
# Fit the model to the training data
regressor.fit(X_train, y_train)

# Make predictions on the test set, using our model
y_pred = regressor.predict(X_test)

# Evaluating the model (Am I doing this correct?)

# How well did it do?
print(metrics.mean_absolute_error(y_test, y_pred))
print(metrics.mean_squared_error(y_test, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Answer 1

让我们回答一下：我认为你正在测量（至少用代码）。但是：

谁告诉你这种关系是线性的？你正试图预测利润（对吗？）。我会说线性回归可能不会很好。所以，如果你没有取得好成绩，我并不感到惊讶。
要了解您的预测是如何运作的，请尝试绘制预测与实际情况，并检查您的积分在一条线上的保留程度。

总结一下：你获得大值的事实并不意味着你的代码是错误的。很可能这种关系不是线性的。

旁注：使用分类变量可能是问题的根源。您是否尝试过无状态的线性回归？你有什么结果？哪个变量在回归中最重要？你应该检查一下。你的R平方是什么？

我希望这有帮助，翁贝托

我测量多元线性回归模型的性能是否正确？

1 个答案: