这可能是一个有点愚蠢的问题(可能是微不足道的问题),但我是机器学习的新手。这可以很容易地从我提出的代码中推断出来,并且它不是一个制定得不好的问题的借口。如果您发现这个问题制定得很差,请通知我,以便我可以更新。
我训练了一个多元线性回归模型,我想看看它对给定数据集的执行情况。所以,我搜索了一下,我发现a nice article解释了我如何找出预测值的“错误”,从真实的。它给我的几个选项是:
我应用了所有这些,他们给了我非常高的价值,所以我不知道这些是否正确或我应该如何解释它们。
输出文章正在接收:
我的模型收到的输出:
'TLDR'问题:我是否正确使用上述方法测量错误,这些结果是否意味着我的模型表现非常糟糕? (当我将预测与真实值进行比较时,这似乎不是这样)
Here您可以查看我正在使用的数据集。
我用来创建模型和预测值的代码(我试图删除不需要的代码)
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn import metrics
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values # Independent variables
y = dataset.iloc[:, 4].values # Dependent variable
# Encode categorical data into numerical values (1, 2, 3)
# For example; New york becomes 1 and Florida becomes 2
labelencoder_states = LabelEncoder()
# We just want to apply this to the state column, since this has categorical data
states_encoded = labelencoder_states.fit_transform(X[:, 3])
# Update the states with the new encoded data
X[:, 3] = states_encoded
# Now that we have the categories as numerical data,
# we can split them into multiple dummy variables:
# Split the categories into columns (more optimal)
# Tell it too look at the state column
onehotencoder_states = OneHotEncoder(categorical_features = [3])
# Actually transforms them into columns
X = onehotencoder_states.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
# Remove the first column from X
# Since; dummy variables -1
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
# In this case we are going to use 40 of the 50 records for training
# and ten of the 50 for testing, hence the 0.2 split ratio
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create a regressor
regressor = LinearRegression()
# Fit the model to the training data
regressor.fit(X_train, y_train)
# Make predictions on the test set, using our model
y_pred = regressor.predict(X_test)
# Evaluating the model (Am I doing this correct?)
# How well did it do?
print(metrics.mean_absolute_error(y_test, y_pred))
print(metrics.mean_squared_error(y_test, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
答案 0 :(得分:1)
让我们回答一下: 我认为你正在测量(至少用代码)。但是:
谁告诉你这种关系是线性的?你正试图预测利润(对吗?)。我会说线性回归可能不会很好。所以,如果你没有取得好成绩,我并不感到惊讶。
要了解您的预测是如何运作的,请尝试绘制预测与实际情况,并检查您的积分在一条线上的保留程度。
总结一下:你获得大值的事实并不意味着你的代码是错误的。很可能这种关系不是线性的。
旁注:使用分类变量可能是问题的根源。您是否尝试过无状态的线性回归?你有什么结果?哪个变量在回归中最重要?你应该检查一下。你的R平方是什么?
我希望这有帮助,翁贝托