这是我可以从单重训练模型中进行评估的方法
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='auc', verbose=100, early_stopping_rounds=200)
import shap # package used to calculate Shap values
# Create object that can calculate shap values
explainer = shap.TreeExplainer(clf)
# Calculate Shap values
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
您知道来自不同折页的结果可能有所不同-如何计算此shap_values的平均值?
答案 0 :(得分:0)
因为我们有这样的规则:
可以对具有相同输出的模型的SHAP值求平均值 在相同的输入功能上进行训练,只需确保对 每个解释器的期望值。但是,如果您有 非重叠的测试集,那么您将无法平均SHAP值 测试集,因为它们用于不同的样本。你可以 使用您的每一个解释整个数据集的SHAP值 模型,然后将其平均为一个矩阵。 (可以 解释训练集中的例子,只记得你可能是 对他们过度适应)
因此,我们在这里需要一些保留数据集来遵循该规则。我做了这样的事情来使一切按预期工作:
shap_values = None
from sklearn.model_selection import cross_val_score, StratifiedKFold
(X_train, X_test, y_train, y_test) = train_test_split(df[feat], df['target'].values,
test_size=0.2, shuffle = True,stratify =df['target'].values,
random_state=42)
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
folds_idx = [(train_idx, val_idx)
for train_idx, val_idx in folds.split(X_train, y=y_train)]
auc_scores = []
oof_preds = np.zeros(df[feat].shape[0])
test_preds = []
for n_fold, (train_idx, valid_idx) in enumerate(folds_idx):
train_x, train_y = df[feat].iloc[train_idx], df['target'].iloc[train_idx]
valid_x, valid_y = df[feat].iloc[valid_idx], df['target'].iloc[valid_idx]
clf = lgb.LGBMClassifier(nthread=4, boosting_type= 'gbdt', is_unbalance= True,random_state = 42,
learning_rate= 0.05, max_depth= 3,
reg_lambda=0.1 , reg_alpha= 0.01,min_child_samples= 21,subsample_for_bin= 5000,
metric= 'auc', n_estimators= 5000 )
clf.fit(train_x, train_y,
eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric='auc', verbose=False, early_stopping_rounds=100)
explainer = shap.TreeExplainer(clf)
if shap_values is None:
shap_values = explainer.shap_values(X_test)
else:
shap_values += explainer.shap_values(X_test)
oof_preds[valid_idx] = clf.predict_proba(valid_x)[:, 1]
auc_scores.append(roc_auc_score(valid_y, oof_preds[valid_idx]))
print( 'AUC: ', np.mean(auc_scores))
shap_values /= 10 # number of folds
shap.summary_plot(shap_values, X_test)