我的目的是基于numpy数组a,b和c中的聚合特征组构建用于回归的逐步子集seletor。我知道如何查看所有列(下面发布),我不知道如何工作就是处理列组。以下是我的数据的表示:
a = np.array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]])
b = np.array([[ 88., 42.5, 9. ],
[ 121.5, 76., 42.5],
[ 167., 121.5, 88. ]])
c = np.array([[ 88., 42.5, 13. ],
[ 117.5, 72., 42.5],
[ 163., 117.5, 88. ]])
total features = [a,b,c]
result = np.empty((3,8), dtype=object)
n, p = result.shape
result = np.c_[a,b,c]
产生预期结果:
[[ 1. 1. 88. 42.5 9. 88. 42.5 13. ]
[ 1. 1. 121.5 76. 42.5 117.5 72. 42.5]
[ 1. 1. 167. 121.5 88. 163. 117.5 88. ]]
回到逐步过程,这里是我如何处理列选择单独查看每个特征,然后在对模型中的所有可用特征进行拟合后将最佳模型附加到features_in_model:
features_in_model = []
excluded = list(set(x_train.columns)-set(features_in_model))
for feature in excluded:
x_train_new = x_train[features_in_model+[feature]]
.....
如何构建列表"排除" (处理上述变量)处理要素组以进行比较时。 排除的列表应该包含所有包含的功能,并且在每次迭代时都会删除一个组。
供参考,这是一个前进的逐步程序: https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn
答案 0 :(得分:2)
我建议这样的事情:
from collections import OrderedDict
feature_set = np.array([[ 1., 1., 88., 42.5, 9., 88., 42.5, 13. ],
[ 1., 1., 121.5, 76., 42.5, 117.5, 72., 42.5],
[ 1., 1., 167., 121.5, 88., 163., 117.5, 88. ]])
feature_column_index = OrderedDict()
feature_column_index['feature1'] = 0
feature_column_index['feature2'] = 1
feature_column_index['feature3'] = 2
feature_column_index['feature4'] = 3
feature_column_index['feature5'] = 4
feature_column_index['feature6'] = 5
feature_column_index['feature7'] = 6
feature_column_index['feature8'] = 7
excluded_features = ['feature2', 'feature7']
include_columns = [kv[1] for kv in feature_column_index.items() if kv[0] not in excluded_features]
print(include_columns)
feature_subset = feature_set[:, include_columns]
print(feature_subset)
产生所需的列子集:
[0, 2, 3, 4, 5, 7]
[[ 1. 88. 42.5 9. 88. 13. ]
[ 1. 121.5 76. 42.5 117.5 42.5]
[ 1. 167. 121.5 88. 163. 88. ]]
请注意,OrderedDict按插入顺序排序。您可以按照所需的任何顺序创建要素图,但除非按列索引顺序对其进行排序,否则最终会改变要素子集中列的顺序。我按索引顺序创建它,以便在输出中维护列顺序。
由于海报希望将这些作为一组功能处理(' a'''' c'),这可以通过维护来完成根据这些要素子集进行列映射,作为上述更一般特征映射的修改:
from collections import OrderedDict
feature_set = np.array([[ 1., 1., 88., 42.5, 9., 88., 42.5, 13. ],
[ 1., 1., 121.5, 76., 42.5, 117.5, 72., 42.5],
[ 1., 1., 167., 121.5, 88., 163., 117.5, 88. ]])
feature_subset_mapping = OrderedDict()
feature_subset_mapping['a'] = [0,1]
feature_subset_mapping['b'] = [2,3,4]
feature_subset_mapping['c'] = [5,6,7]
excluded_subsets = ['b']
include_columns = []
for subset in [kv[1] for kv in feature_subset_mapping.items() if kv[0] not in excluded_subsets]:
include_columns = include_columns+subset
print(include_columns)
feature_subset = feature_set[:, include_columns]
print(feature_subset)
哪个收益率:
[0, 1, 5, 6, 7]
[[ 1. 1. 88. 42.5 13. ]
[ 1. 1. 117.5 72. 42.5]
[ 1. 1. 163. 117.5 88. ]]