如何获得通过折叠平均的模型的SHAP值?

时间:2018-11-16 14:10:40

标签: python machine-learning

这是我可以从单重训练模型中进行评估的方法

clf.fit(X_train, y_train, 
        eval_set=[(X_train, y_train), (X_test, y_test)], 
        eval_metric='auc', verbose=100, early_stopping_rounds=200)
import shap  # package used to calculate Shap values
# Create object that can calculate shap values
explainer = shap.TreeExplainer(clf)
# Calculate Shap values
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

您知道来自不同折页的结果可能有所不同-如何计算此shap_values的平均值?

1 个答案:

答案 0 :(得分:0)

因为我们有这样的规则:

  

可以对具有相同输出的模型的SHAP值求平均值   在相同的输入功能上进行训练,只需确保对   每个解释器的期望值。但是,如果您有   非重叠的测试集,那么您将无法平均SHAP值   测试集,因为它们用于不同的样本。你可以   使用您的每一个解释整个数据集的SHAP值   模型,然后将其平均为一个矩阵。 (可以   解释训练集中的例子,只记得你可能是   对他们过度适应)

因此,我们在这里需要一些保留数据集来遵循该规则。我做了这样的事情来使一切按预期工作:

shap_values = None
from sklearn.model_selection import cross_val_score, StratifiedKFold
(X_train, X_test, y_train, y_test) = train_test_split(df[feat], df['target'].values, 
                                     test_size=0.2, shuffle  = True,stratify =df['target'].values,
                                     random_state=42) 

folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
folds_idx = [(train_idx, val_idx) 
                 for train_idx, val_idx in folds.split(X_train, y=y_train)]
auc_scores = []
oof_preds = np.zeros(df[feat].shape[0])
test_preds = []

for n_fold, (train_idx, valid_idx) in enumerate(folds_idx):
    train_x, train_y = df[feat].iloc[train_idx], df['target'].iloc[train_idx]
    valid_x, valid_y = df[feat].iloc[valid_idx], df['target'].iloc[valid_idx]    
    clf = lgb.LGBMClassifier(nthread=4,            boosting_type= 'gbdt', is_unbalance= True,random_state = 42,
            learning_rate= 0.05, max_depth= 3,
            reg_lambda=0.1 , reg_alpha= 0.01,min_child_samples= 21,subsample_for_bin= 5000,
            metric= 'auc', n_estimators= 5000    )
    clf.fit(train_x, train_y, 
            eval_set=[(train_x, train_y), (valid_x, valid_y)], 
            eval_metric='auc', verbose=False, early_stopping_rounds=100)
    explainer = shap.TreeExplainer(clf)
    if shap_values is None:
        shap_values = explainer.shap_values(X_test)
    else:
        shap_values += explainer.shap_values(X_test)       
    oof_preds[valid_idx] = clf.predict_proba(valid_x)[:, 1]   
    auc_scores.append(roc_auc_score(valid_y, oof_preds[valid_idx]))
print( 'AUC: ', np.mean(auc_scores))
shap_values /= 10 # number of folds
shap.summary_plot(shap_values, X_test)