在Python中使用贪婪特征选择算法进行线性回归

时间:2018-11-04 09:08:51

标签: python machine-learning linear-regression feature-selection

这是我正在学习的机器学习课程的一项家庭作业问题。对于所采取的方法,有效的方法和无效的方法,我将尽一切可能进行描述。


我们提供了四种类型的数据集:dev_sample.npydev_label.npytest_sample.npytest_label.npy。我们首先按如下方式加载数据集:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

X_dev = np.load("./dev_sample.npy") # shape (900, 126)
y_dev = np.load("./dev_label.npy") # shape (900,)
X_test = np.load("/test_sample.npy") # shape (100, 126)
y_test = np.load("./test_label.npy") # shape (100,)

我们需要解决的问题是实施“贪婪特征选择”算法,直到选择了126个特征中的100个最佳。基本上,我们训练具有一种功能的模型,选择最佳的一种并将其存储,将剩下的每个特征与所选特征配对的125种模型进行训练,选择次佳的一种进行存储,然后继续直到达到100种。

代码如下:

# Define linear regression function
# You may use sklearn.linear_model.LinearRegression
# Your code here
lin_reg = LinearRegression()
# End your code

# Basic settings. DO NOT MODIFY
selected_feature = []
sel_num = 100
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)

selected_train_error = []
selected_valid_error = []

# For greedy selection
for sel in range(sel_num) :
    min_train_error = +1000
    min_valid_error = +1000
    min_feature = 0

    for i in range(X_dev.shape[1]) :
        train_error_ith = []
        valid_error_ith = []

        # Select feature greedy
        # Hint : There should be no duplicated feature in selected_feature

        # Your code here
        X_dev_fs = X_dev[:, i]
        if (i in selected_feature):
            continue
        else:
            pass
        # End your code


        # For cross validation
        for train_index, test_index in cv.split(X_dev) : # train_index.shape = 720, test_index.shape = 180, 5 iterations
            X_train, X_valid = X_dev_fs[train_index], X_dev_fs[test_index]
            y_train, y_valid = y_dev[train_index], y_dev[test_index]

            # Derive training error, validation error
            # You may use sklearn.metrics.mean_squared_error, model.fit(), model.predict()

            # Your code here
            model_train = lin_reg.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))
            predictions_train = model_train.predict(X_valid.reshape(-1, 1))
            train_error_ith.append(mean_squared_error(y_valid, predictions_train))

            model_valid = lin_reg.fit(X_valid.reshape(-1, 1), y_valid.reshape(-1, 1))
            predictions_valid = model_valid.predict(X_valid.reshape(-1, 1))
            valid_error_ith.append(mean_squared_error(y_valid, predictions_valid))

            # End your code

    # Select best performance feature set on each features
    # You should choose the feature which has minimum mean cross validation error

    # Your code here

    min_train_error = train_error_ith[np.argmin(train_error_ith)]
    min_valid_error = valid_error_ith[np.argmin(valid_error_ith)]
    min_feature = np.argmin(valid_error_ith)

    # End your code

print('='*50)
print("# of selected feature(s) : {}".format(sel+1))
print("min_train_error: {}".format(min_train_error))
print("min_valid_error: {}".format(min_valid_error))
print("Selected feature of this iteration : {}".format(min_feature))
selected_feature.append(min_feature)
selected_train_error.append(min_train_error)
selected_valid_error.append(min_valid_error)


在填充#Your code部分时,我想到的算法是X_dev_fs将保留当前迭代的特征以及先前选择的特征。然后,我们将使用交叉验证来推导训练和简历错误。

运行此程序后得到的当前输出是

==================================================
# of selected feature(s) : 1
min_train_error: 9.756743239446392
min_valid_error: 9.689856536723353
Selected feature of this iteration : 1
==================================================
# of selected feature(s) : 2
min_train_error: 9.70991346883164
min_valid_error: 9.674875050182653
Selected feature of this iteration : 1
==================================================

以此类推,# of selected feature(s)一直持续到100。

问题是Selected feature of this iteration :不应多次输出相同的数字。我也很难弄清楚如何存储最佳功能并将其用于后续迭代。

我遇到的问题是:

  1. 为什么我的selected_feature列表包含相同的重复功能,以及如何防止重复出现?

  2. 如何将最佳功能存储在selected_feature中,然后将其与每个后续其余功能配对使用?


任何反馈都值得赞赏。谢谢。


编辑

这里是指向我正在加载到变量中的文件的链接,以防万一有人需要它们。

dev_sample.npy

dev_label.npy

test_sample.npy

test_label.npy

0 个答案:

没有答案