这是我正在学习的机器学习课程的一项家庭作业问题。对于所采取的方法,有效的方法和无效的方法,我将尽一切可能进行描述。
我们提供了四种类型的数据集:dev_sample.npy
,dev_label.npy
,test_sample.npy
和test_label.npy
。我们首先按如下方式加载数据集:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
X_dev = np.load("./dev_sample.npy") # shape (900, 126)
y_dev = np.load("./dev_label.npy") # shape (900,)
X_test = np.load("/test_sample.npy") # shape (100, 126)
y_test = np.load("./test_label.npy") # shape (100,)
我们需要解决的问题是实施“贪婪特征选择”算法,直到选择了126个特征中的100个最佳。基本上,我们训练具有一种功能的模型,选择最佳的一种并将其存储,将剩下的每个特征与所选特征配对的125种模型进行训练,选择次佳的一种进行存储,然后继续直到达到100种。
代码如下:
# Define linear regression function
# You may use sklearn.linear_model.LinearRegression
# Your code here
lin_reg = LinearRegression()
# End your code
# Basic settings. DO NOT MODIFY
selected_feature = []
sel_num = 100
valid_split = 1/5
cv = ShuffleSplit(n_splits=5, test_size=valid_split, random_state=0)
selected_train_error = []
selected_valid_error = []
# For greedy selection
for sel in range(sel_num) :
min_train_error = +1000
min_valid_error = +1000
min_feature = 0
for i in range(X_dev.shape[1]) :
train_error_ith = []
valid_error_ith = []
# Select feature greedy
# Hint : There should be no duplicated feature in selected_feature
# Your code here
X_dev_fs = X_dev[:, i]
if (i in selected_feature):
continue
else:
pass
# End your code
# For cross validation
for train_index, test_index in cv.split(X_dev) : # train_index.shape = 720, test_index.shape = 180, 5 iterations
X_train, X_valid = X_dev_fs[train_index], X_dev_fs[test_index]
y_train, y_valid = y_dev[train_index], y_dev[test_index]
# Derive training error, validation error
# You may use sklearn.metrics.mean_squared_error, model.fit(), model.predict()
# Your code here
model_train = lin_reg.fit(X_train.reshape(-1, 1), y_train.reshape(-1, 1))
predictions_train = model_train.predict(X_valid.reshape(-1, 1))
train_error_ith.append(mean_squared_error(y_valid, predictions_train))
model_valid = lin_reg.fit(X_valid.reshape(-1, 1), y_valid.reshape(-1, 1))
predictions_valid = model_valid.predict(X_valid.reshape(-1, 1))
valid_error_ith.append(mean_squared_error(y_valid, predictions_valid))
# End your code
# Select best performance feature set on each features
# You should choose the feature which has minimum mean cross validation error
# Your code here
min_train_error = train_error_ith[np.argmin(train_error_ith)]
min_valid_error = valid_error_ith[np.argmin(valid_error_ith)]
min_feature = np.argmin(valid_error_ith)
# End your code
print('='*50)
print("# of selected feature(s) : {}".format(sel+1))
print("min_train_error: {}".format(min_train_error))
print("min_valid_error: {}".format(min_valid_error))
print("Selected feature of this iteration : {}".format(min_feature))
selected_feature.append(min_feature)
selected_train_error.append(min_train_error)
selected_valid_error.append(min_valid_error)
在填充#Your code
部分时,我想到的算法是X_dev_fs
将保留当前迭代的特征以及先前选择的特征。然后,我们将使用交叉验证来推导训练和简历错误。
运行此程序后得到的当前输出是
==================================================
# of selected feature(s) : 1
min_train_error: 9.756743239446392
min_valid_error: 9.689856536723353
Selected feature of this iteration : 1
==================================================
# of selected feature(s) : 2
min_train_error: 9.70991346883164
min_valid_error: 9.674875050182653
Selected feature of this iteration : 1
==================================================
以此类推,# of selected feature(s)
一直持续到100。
问题是Selected feature of this iteration :
不应多次输出相同的数字。我也很难弄清楚如何存储最佳功能并将其用于后续迭代。
我遇到的问题是:
为什么我的selected_feature
列表包含相同的重复功能,以及如何防止重复出现?
如何将最佳功能存储在selected_feature
中,然后将其与每个后续其余功能配对使用?
任何反馈都值得赞赏。谢谢。
编辑
这里是指向我正在加载到变量中的文件的链接,以防万一有人需要它们。