我有过去6个月的数据,其中包括数十万个用户查询。一个查询平均返回500个项目。我正在使用LightGBM成对学习对算法进行排序,以对从用户查询返回的项目进行排序,在这里进行培训,我将对每个用户查询进行分组(如here所述)。
基于this和this,我了解SHAP并未在计算特征重要性时进行分组预测,这可能会导致结果有偏差。我对当地的解释很感兴趣,为什么对特定查询而言,特定项目排在首位。
要获取此信息,我仅在包含查询结果的数据上定义解释器,该查询包含我要解释的项目。示例代码为:
# the item for which we want to extract the local shapley values
item_of_interest = "ABC"
# subset the training data to include only this item
item_to_explain_df = data.loc[data.item == item_of_interest]
# find the query that generated this item
query_of_interest = data.loc[data.item == item_of_interest].query.unique()
# subset the training data to include the results from the query that generated the item of interest
data_for_one_query_df = data.loc[data.query == query_of_interest]
# define a tree explainer
# model is a LightGBM learning to rank model
# data_for_one_query_df is the data containing the results from the query that includes the item I am trying to explain
explainer = shap.TreeExplainer(model=model, data=data_for_one_query_df)
# get shapley values
# item_to_explain_df is the item I am trying to explain
shap_values = explainer.shap_values(data=item_to_explain_df)
# plot
shap.force_plot(explainer.expected_value, shap_values, item_to_explain_df)
与基于整个培训数据定义解释器时相比,以上给出的结果截然不同。我认为,鉴于SHAP并未进行分组预测,因此在ONE查询中定义解释器并为其提取期望值更加准确,但是我没有找到任何支持文档。有人在组学习中使用SHAP进行算法排名有经验吗,我的方法正确吗?
为清楚起见,如果我将全部数据输入到解释器中,流程将是:
# define a tree explainer
explainer = shap.TreeExplainer(model=model)
# get shapley values
# item_to_explain is the item I am trying to explain
shap_values = explainer.shap_values(data=item_to_explain_df)
# plot
shap.force_plot(explainer.expected_value, shap_values, item_to_explain_df)