特征重要性确定和关联

时间:2018-11-22 18:01:50

标签: python heatmap correlation feature-selection

我想知道我的哪些变量对SalePrice的影响最大 在我的DataFrame df_train中。

   Id  MSSubClass MSZoning    ...     SaleType  SaleCondition SalePrice
0   1          60       RL    ...           WD         Normal    208500
1   2          20       RL    ...           WD         Normal    181500
2   3          60       RL    ...           WD         Normal    223500
3   4          70       RL    ...           WD        Abnorml    140000
4   5          60       RL    ...           WD         Normal    250000

为此,我对sklearn和feature_importances_进行了分析。 带有热图的关联和可视化代码为:

corrmat = df_train.corr()
k = 20 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

对于功能重要性的确定是:

feature_labels = np.array(['OverallQual', 'GrLivArea', 'SimplOverallQual', 'ExterQual', 'GarageCars', 'KitchenQual', 'SimplExterQual', 'GarageArea', 'SimplKitchenQual', 'TotalBsmtSF', 'FullBath', 'YearBuilt', '1stFlrSF', 'YearRemodAdd', 'TotRmsAbvGrd', 'Fireplaces', 'HeatingQC', 'LotArea', 'MasVnrArea']) importance = model.feature_importances_ feature_indexes_by_importance = importance.argsort()

indices = np.argsort(importance)[::-1] for index in feature_indexes_by_importance:
    print('{}-{:.2f}%'.format(feature_labels[index], (importance[index] *100.0)))
根据热图,

'OverallQual', 'GrLivArea''SimplQual'是与SalePrice相关性最高的变量。 根据{{​​1}},最重要的是:

feature importance

什么可以解释为什么sklearn的相关性与GarageArea-9.71% GrLivArea-15.43% LotArea-17.46% 不相关? 谢谢

1 个答案:

答案 0 :(得分:1)

我想您在谈论树木feature_importances_? (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Correlation度量要素与输出之间的线性相关性,随机森林使用与线性相关性无关的非线性分类,并且能够提取在任务中非线性最重要的特征。 / p>