Numpy矩阵在逐步程序中连续

时间:2018-05-12 20:17:52

标签: python numpy

我的目的是基于numpy数组a,b和c中的聚合特征组构建用于回归的逐步子集seletor。我知道如何查看所有列(下面发布),我不知道如何工作就是处理列组。以下是我的数据的表示:

a = np.array([[ 1.,  1.],
              [ 1.,  1.],
              [ 1.,  1.]])

b = np.array([[  88.,    42.5,    9. ],
              [ 121.5,   76.,    42.5],
              [ 167.,  121.5,   88. ]])

c = np.array([[  88.,    42.5,   13. ],
              [ 117.5,   72.,    42.5],
              [ 163.,   117.5,  88. ]])

total features = [a,b,c]
result = np.empty((3,8), dtype=object)

n, p = result.shape
result = np.c_[a,b,c]

产生预期结果:

[[  1.    1.    88.   42.5   9.   88.   42.5  13. ]
 [  1.    1.    121.5  76.   42.5 117.5  72.   42.5]
 [  1.    1.    167.  121.5  88.  163.  117.5  88. ]]

回到逐步过程,这里是我如何处理列选择单独查看每个特征,然后在对模型中的所有可用特征进行拟合后将最佳模型附加到features_in_model:

features_in_model = [] 
excluded = list(set(x_train.columns)-set(features_in_model))
for feature in excluded:
     x_train_new = x_train[features_in_model+[feature]]
.....

如何构建列表"排除" (处理上述变量)处理要素组以进行比较时。 排除的列表应该包含所有包含的功能,并且在每次迭代时都会删除一个组。

供参考,这是一个前进的逐步程序: https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

1 个答案:

答案 0 :(得分:2)

我建议这样的事情:

from collections import OrderedDict

feature_set = np.array([[  1.,    1.,    88.,   42.5,   9.,   88.,   42.5,  13. ],
                        [  1.,    1.,    121.5,  76.,  42.5, 117.5, 72.,   42.5],
                        [  1.,    1.,    167.,  121.5,  88.,  163.,  117.5,  88. ]])

feature_column_index = OrderedDict()
feature_column_index['feature1'] = 0
feature_column_index['feature2'] = 1
feature_column_index['feature3'] = 2
feature_column_index['feature4'] = 3
feature_column_index['feature5'] = 4
feature_column_index['feature6'] = 5
feature_column_index['feature7'] = 6
feature_column_index['feature8'] = 7

excluded_features = ['feature2', 'feature7']

include_columns = [kv[1] for kv in feature_column_index.items() if kv[0] not in excluded_features]
print(include_columns)

feature_subset = feature_set[:, include_columns]

print(feature_subset)

产生所需的列子集:

[0, 2, 3, 4, 5, 7]

[[  1.   88.   42.5   9.   88.   13. ]
 [  1.  121.5  76.   42.5 117.5  42.5]
 [  1.  167.  121.5  88.  163.   88. ]]

请注意,OrderedDict按插入顺序排序。您可以按照所需的任何顺序创建要素图,但除非按列索引顺序对其进行排序,否则最终会改变要素子集中列的顺序。我按索引顺序创建它,以便在输出中维护列顺序。

由于海报希望将这些作为一组功能处理(' a'''' c'),这可以通过维护来完成根据这些要素子集进行列映射,作为上述更一般特征映射的修改:

from collections import OrderedDict

feature_set = np.array([[  1.,    1.,    88.,   42.5,   9.,   88.,   42.5,  13. ],
                        [  1.,    1.,    121.5,  76.,  42.5, 117.5, 72.,   42.5],
                        [  1.,    1.,    167.,  121.5,  88.,  163.,  117.5,  88. ]])

feature_subset_mapping = OrderedDict()
feature_subset_mapping['a'] = [0,1]
feature_subset_mapping['b'] = [2,3,4]
feature_subset_mapping['c'] = [5,6,7]

excluded_subsets = ['b']

include_columns = []
for subset in [kv[1] for kv in feature_subset_mapping.items() if kv[0] not in excluded_subsets]:
    include_columns = include_columns+subset

print(include_columns)

feature_subset = feature_set[:, include_columns]

print(feature_subset)

哪个收益率:

[0, 1, 5, 6, 7]

[[  1.    1.   88.   42.5  13. ]
 [  1.    1.  117.5  72.   42.5]
 [  1.    1.  163.  117.5  88. ]]