我正在尝试运行lightgbm进行功能选择;
初始化
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary',
boosting_type = 'goss',
n_estimators = 10000, class_weight ='balanced')
然后我按如下所示拟合模型
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
但是我收到以下错误
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is: [6] valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,)
答案 0 :(得分:3)
要根据重要性是使用model
还是scikit-learn
方法训练lightgbm
来获得重要性,我们应该分别选择feature_importances_
属性或feature_importance()
函数,例如在此示例中(其中model
是lgbm.fit() / lgbm.train()
和train_columns = x_train_df.columns
的结果):
import pandas as pd
def get_lgbm_varimp(model, train_columns, max_vars=50):
if "basic.Booster" in str(model.__class__):
# lightgbm.basic.Booster was trained directly, so using feature_importance() function
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
else:
# Scikit-learn API LGBMClassifier or LGBMRegressor was fitted,
# so using feature_importances_ property
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T
cv_varimp_df.columns = ['feature_name', 'varimp']
cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)
cv_varimp_df = cv_varimp_df.iloc[0:max_vars]
return cv_varimp_df
请注意,我们所依据的假设是,特征重要性值的排序方式与训练期间对模型矩阵列的排序方式一样(包括单热虚拟列),请参见LightGBM #209。
答案 1 :(得分:0)
使用<ids>⿺辶⿴宀⿱珤⿰隹⿰貝招</ids>
模型时在lightgbm
中获得功能重要性的示例。
train
答案 2 :(得分:0)
LightGBM 3.1.1版本,扩展@user3067175的评论:
pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)
是一个特征名称列表,在你的数据集的相同顺序内,可以用features = df_train.columns.tolist()
代替。
这应该以相同的绘图顺序返回特征重要性。
注意:如果你使用LGBMRegressor,你应该使用
pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)
答案 3 :(得分:0)
如果要检查没有训练数据的已加载模型,可以通过以下方式获取特征重要性和特征名称
df_feature_importance = (
pd.DataFrame({
'feature': model.feature_name(),
'importance': model.feature_importance(),
})
.sort_values('importance', ascending=False)
)