使用lightgbm的功能重要性

时间:2018-11-21 13:58:08

标签: python python-3.x lightgbm

我正在尝试运行lightgbm进行功能选择;

初始化

# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', 
         boosting_type = 'goss', 
         n_estimators = 10000, class_weight ='balanced')

然后我按如下所示拟合模型

# Fit the model twice to avoid overfitting
for i in range(2):

   # Split into training and validation set
   train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)

   # Train using early stopping
   model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], 
             eval_metric = 'auc', verbose = 200)

   # Record the feature importances
   feature_importances += model.feature_importances_

但是我收到以下错误

Training until validation scores don't improve for 100 rounds. 
Early stopping, best iteration is: [6]  valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,) 

4 个答案:

答案 0 :(得分:3)

要根据重要性是使用model还是scikit-learn方法训练lightgbm来获得重要性,我们应该分别选择feature_importances_属性或feature_importance()函数,例如在此示例中(其中modellgbm.fit() / lgbm.train()train_columns = x_train_df.columns的结果):

import pandas as pd

def get_lgbm_varimp(model, train_columns, max_vars=50):
    
    if "basic.Booster" in str(model.__class__):
        # lightgbm.basic.Booster was trained directly, so using feature_importance() function 
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
    else:
        # Scikit-learn API LGBMClassifier or LGBMRegressor was fitted, 
        # so using feature_importances_ property
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T

    cv_varimp_df.columns = ['feature_name', 'varimp']

    cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)

    cv_varimp_df = cv_varimp_df.iloc[0:max_vars]   

    return cv_varimp_df
    

请注意,我们所依据的假设是,特征重要性值的排序方式与训练期间对模型矩阵列的排序方式一样(包括单热虚拟列),请参见LightGBM #209

答案 1 :(得分:0)

使用<ids>⿺辶⿴宀⿱珤⿰隹⿰貝招</ids> 模型时在lightgbm中获得功能重要性的示例。

train

答案 2 :(得分:0)

LightGBM 3.1.1版本,扩展@user3067175的评论:

pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)

是一个特征名称列表,在你的数据集的相同顺序内,可以用features = df_train.columns.tolist()代替。 这应该以相同的绘图顺序返回特征重要性。

注意:如果你使用LGBMRegressor,你应该使用

pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)

答案 3 :(得分:0)

如果要检查没有训练数据的已加载模型,可以通过以下方式获取特征重要性和特征名称

df_feature_importance = (
    pd.DataFrame({
        'feature': model.feature_name(),
        'importance': model.feature_importance(),
    })
    .sort_values('importance', ascending=False)
)