Question

这是我正在学习的机器学习课程的一项家庭作业问题。对于所采取的方法，有效的方法和无效的方法，我将尽一切可能进行描述。

我们提供了四种类型的数据集：dev_sample.npy，dev_label.npy，test_sample.npy和test_label.npy。我们首先按如下方式加载数据集：

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

X_dev = np.load("./dev_sample.npy") # shape (900, 126)
y_dev = np.load("./dev_label.npy") # shape (900,)
X_test = np.load("/test_sample.npy") # shape (100, 126)
y_test = np.load("./test_label.npy") # shape (100,)

我们需要解决的问题是实施“贪婪特征选择”算法，直到选择了126个特征中的100个最佳。基本上，我们训练具有一种功能的模型，选择最佳的一种并将其存储，将剩下的每个特征与所选特征配对的125种模型进行训练，选择次佳的一种进行存储，然后继续直到达到100种。

代码如下：

# Define linear regression function
# You may use sklearn.linear_model.LinearRegression
# Your code here
lin_reg = LinearRegression()
# End your code

# Basic settings. DO NOT MODIFY
selected_feature = []
sel_num = 100
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)

selected_train_error = []
selected_valid_error = []

# For greedy selection
for sel in range(sel_num) :
    min_train_error = +1000
    min_valid_error = +1000
    min_feature = 0

    for i in range(X_dev.shape[1]) :
        train_error_ith = []
        valid_error_ith = []

        # Select feature greedy
        # Hint : There should be no duplicated feature in selected_feature

        # Your code here
        X_dev_fs = X_dev[:, i]
        if (i in selected_feature):
            continue
        else:
            pass
        # End your code


        # For cross validation
        for train_index, test_index in cv.split(X_dev) : # train_index.shape = 720, test_index.shape = 180, 5 iterations
            X_train, X_valid = X_dev_fs[train_index], X_dev_fs[test_index]
            y_train, y_valid = y_dev[train_index], y_dev[test_index]

            # Derive training error, validation error
            # You may use sklearn.metrics.mean_squared_error, model.fit(), model.predict()

            # Your code here
            model_train = lin_reg.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))
            predictions_train = model_train.predict(X_valid.reshape(-1, 1))
            train_error_ith.append(mean_squared_error(y_valid, predictions_train))

            model_valid = lin_reg.fit(X_valid.reshape(-1, 1), y_valid.reshape(-1, 1))
            predictions_valid = model_valid.predict(X_valid.reshape(-1, 1))
            valid_error_ith.append(mean_squared_error(y_valid, predictions_valid))

            # End your code

    # Select best performance feature set on each features
    # You should choose the feature which has minimum mean cross validation error

    # Your code here

    min_train_error = train_error_ith[np.argmin(train_error_ith)]
    min_valid_error = valid_error_ith[np.argmin(valid_error_ith)]
    min_feature = np.argmin(valid_error_ith)

    # End your code

print('='*50)
print("# of selected feature(s) : {}".format(sel+1))
print("min_train_error: {}".format(min_train_error))
print("min_valid_error: {}".format(min_valid_error))
print("Selected feature of this iteration : {}".format(min_feature))
selected_feature.append(min_feature)
selected_train_error.append(min_train_error)
selected_valid_error.append(min_valid_error)

在填充#Your code部分时，我想到的算法是X_dev_fs将保留当前迭代的特征以及先前选择的特征。然后，我们将使用交叉验证来推导训练和简历错误。

运行此程序后得到的当前输出是

==================================================
# of selected feature(s) : 1
min_train_error: 9.756743239446392
min_valid_error: 9.689856536723353
Selected feature of this iteration : 1
==================================================
# of selected feature(s) : 2
min_train_error: 9.70991346883164
min_valid_error: 9.674875050182653
Selected feature of this iteration : 1
==================================================

以此类推，# of selected feature(s)一直持续到100。

问题是Selected feature of this iteration :不应多次输出相同的数字。我也很难弄清楚如何存储最佳功能并将其用于后续迭代。

我遇到的问题是：

为什么我的selected_feature列表包含相同的重复功能，以及如何防止重复出现？
如何将最佳功能存储在selected_feature中，然后将其与每个后续其余功能配对使用？

任何反馈都值得赞赏。谢谢。

编辑

这里是指向我正在加载到变量中的文件的链接，以防万一有人需要它们。

在Python中使用贪婪特征选择算法进行线性回归

0 个答案: