Question

我是机器学习和Python的新手。我正在尝试在UCI存储库中的一个数据集上构建随机森林回归模型。这是我的第一个ML模型。我的方法可能完全错了。

此处提供了数据集 - https://archive.ics.uci.edu/ml/datasets/abalone

以下是我编写的整个工作代码。我在Windows 7 x64操作系统中使用Python 3.6.4（请原谅我冗长的代码）。

import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest

#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window

root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows

#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options

print("Reading input file...")

# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
                                      "At The Prompt, Enter 'Abalone_Data.csv' File.")

# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
    quit()
else:
    del(File_Checker)

file_loop = 0

while (file_loop == 0):
    # Get path of base file
    file_path =  filedialog.askopenfilename(initialdir = "/",
                                            title = "File Selection Prompt",
                                            filetypes = (("XLSX Files","*.*"), ))

    # Condition to check if user selected a file or not
    if (len(file_path) < 1):
        # Pop-up window to warn uer that no file was selected
        result = messagebox.askretrycancel("File Selection Prompt Error",
                                           "No file has been selected. \nWhat do you want to do?")

        # Condition to repeat the loop or quit program execution
        if (result == True):
            continue
        else:
            quit()

    # Get file name
    file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
    file_name = file_name[-1] # extracts the last element of the list

    # Condition to check if correct file was selected or not
    if (file_name != "Abalone_Data.csv"):
        result = messagebox.askretrycancel("File Selection Prompt Error",
                                           "Incorrect file selected. \nWhat do you want to do?")

        # Condition to repeat the loop or quit program execution
        if (result == True):
            continue
        else:
            quit()

    # Read the base file
    input_file = pd.read_csv(file_path,
                             sep = ',',
                             encoding = 'utf-8',
                             low_memory = True)

    break

# Delete unwanted files
del(file_loop, file_name)

#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")

# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])

# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)

#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")

# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)

# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)

# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)

#------------------------------------------------------------------------------------------------------------------------#
y = y.values 
X = X.values

#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")

# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")

# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message

#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")

# Predicting a new result with regression
y_pred = regressor.predict(X_test)

# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
               'Sex_M' : 0,
               'Length' : 0.5,
               'Diameter' : 0.35,
               'Height' : 0.8,
               'Whole_Weight' : 0.223,
               'Shucked_Weight' : 0.09,
               'Viscera_Weight' : 0.05,
               'Shell_Weight' : 0.07}

# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])

# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
                           'Viscera_Weight', 'Sex_I', 'Sex_M']]

# Applying feature scaling
#test_values = sc_X.transform(test_values)

# Predicting values of new data
new_pred = regressor.predict(test_values)

#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")

# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")

# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))

print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)

当我看到模型的准确性时，下面是我得到的。

Getting Model Accuracy...
Training Accuracy =  0.9359702279804791
Test Accuracy =  0.5695080680053354

以下是我的问题。 1）为什么Training Accuracy和Test Accuracy如此遥远？

2）我怎么知道这个模型是否适合过度？

3）随机森林回归是否适合使用？如果不是，我如何确定此用例的正确模型？

3）如何使用我创建的变量构建混淆矩阵？

4）如何验证模型的性能？

我正在寻找你的指导，这样我也可以从错误中吸取教训并提高我的建模技能。

Answer 1

在尝试回答您的观点之前，请注意：我发现您使用的是具有准确度的回归量作为指标。但准确性是分类问题中使用的度量标准;在回归模型中，您通常使用其他指标，如均方误差（MSE）。请参阅here。

如果您只是切换到更适应的指标，也许您会发现您的模型并不是那么糟糕。

我还是要回复你的问题。

为什么训练准确度和测试准确度如此之远？ 这意味着您过度拟合了训练样本：您的模型在预测训练数据集的数据方面非常强大，但无法概括。就像在一组猫图片上训练的模型，只相信那些图片是猫，而所有其他猫的所有其他图片都没有。事实上，你在测试集上的准确度约为0.5，这基本上是一个随机猜测。

我如何知道此型号是否适合安装？ 准确地形成两组之间的准确度差异。它们彼此越接近，模型能够概括得越多。你已经知道过度装备的样子了。由于两组中的精度都很低，因此通常可以识别欠装。

随机森林回归是否适合使用？如果不是，我如何确定此用例的正确模型？ 没有合适的型号可供使用。随机森林，一般来说，当你处理结构化数据时，所有基于树的模型（LightGBM，XGBoost）都是机器学习的瑞士军刀，因为它们简单可靠。基于深度学习的模型在理论上表现更好，但设置起来要复杂得多。

如何使用我创建的变量构建混淆矩阵？ 您可以在构建分类模型时创建混淆矩阵，并在模型的输出上构建混淆矩阵。你正在使用一个回归器，它没有多大意义。

如何验证模型的性能？ 一般来说，为了对性能进行可靠的验证，你将数据分成三个：你在一个（也就是训练集）上训练，在第二个上调整模型（也就是验证集，这就是你所说的测试集），最后，当你对模型及其超参数感到满意，你在第三个测试它（也就是测试集，不要与你调用的测试集混淆）。最后一个告诉您模型是否概括良好。这是因为当您选择并调整模型时，您还可以过度拟合验证集（您称之为测试集的验证集），也可以选择一组仅在该集上表现良好的超参数。此外，您必须选择可靠的指标，这取决于数据和模型。随着回归，MSE非常好。

Answer 2

使用Trees和Ensemble，你必须要有一些设置。在你的情况下，差异来自“过度拟合”。这意味着，您的模型已经学习了“太多”您的训练数据，并且无法推广到其他数据。

要做的一件重要事情是限制树木的深度。对于每棵树，分支因子为2.这意味着在深度d，你将有2 ^ d个分支。

让我们假设你有1000个训练值。如果你不限制深度（或/和min_samples_leaf），您可以通过a学习完整的数据集深度为10（因为2 ^ 10 = 1024> N_training）。

您可以做的是比较一系列深度的训练精度和测试精度（从基础2中的3到log（n））。如果深度太低，两者的准确度都会很低，因为你需要更多的分支来正确学习数据，它会上升一个峰值，然后训练数据会继续上升，但测试值会下降。它应该类似于下面的模型复杂性图片，这是你的深度。

您还可以使用 min_samples_split 和/或 min_samples_leaf 进行游戏，只有在此分支中有多个数据时，才能让您说分割。结果，这将限制深度并且将允许具有每个分支具有不同深度的树。与前面解释的相同，您可以使用值来查找最佳值（使用网格搜索）。

我希望它有所帮助，

随机森林回归准确度不同于训练集和测试集

2 个答案: