随机森林特征重要性 Python

时间:2021-06-12 23:03:04

标签: python scikit-learn data-science random-forest numpy-ndarray

在执行超参数调整并为我的分类器获取最佳参数后,我试图从我的数据中获取特征重要性。我还为训练集拟合了我的最佳参数,现在我正在尝试获得重要的特征,但我不断收到错误,并尝试了我在互联网上找到的所有可能的解决方案。

在下面查看我的代码;

enter code here
# define models and parameters for hyperparametrs
from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingGridSearchCV

# define grid search

from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_features': ['auto','sqrt'],
    'n_estimators': [100,1000]
}


# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = HalvingGridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
cv = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)



steps_3 = [('over', RandomOverSampler()),  ('chi_square', SelectKBest(chi2, k=7000)), ('estimator', grid_search)]
pipeline_3 = Pipeline(steps=steps_3)
#fit the model
rf_hyperparameter = pipeline_3.fit(X_train, y_train)
print(rf_hyperparameter)

# print('Best parameter set: %s' % grid_search.best_params_)
print("Best Score:" + str(grid_search.best_score_))
print("Best Parameters: " + str(grid_search.best_params_))
best_parameters = grid_search.best_params_

#fit the best parameters to the training data
rf_best = RandomForestClassifier(bootstrap = True, max_features= 'auto', n_estimators = 1000)
rf_best.fit(X_train, y_train)

feature_importances = pd.DataFrame(rf_best.feature_importances_, 
                                   index=X_train.columns,columns=['importance']).sort_values('importance',ascending = False)
feature_importances

运行上面的代码后,这是我得到的错误


AttributeError                            Traceback (most recent call last)
<ipython-input-159-563c7c3e7fc5> in <module>
      1 feature_importances = pd.DataFrame(rf_best.feature_importances_, 
----> 2                                    index=X_train.columns,columns=['importance']).sort_values('importance',ascending = False)
      3 feature_importances

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

我会非常感谢我能得到的任何意见。谢谢!

1 个答案:

答案 0 :(得分:0)

问题中缺少完成 train_test_split 的代码部分。 train_test_split 返回 numpy 数组而不是 pandas 数据帧,因此 X_train.columns 将失败。将 Pandas 数据帧本身的 df.columns 作为 list 并传入 index 应该可以工作。