Question

我有过去6个月的数据，其中包括数十万个用户查询。一个查询平均返回500个项目。我正在使用LightGBM成对学习对算法进行排序，以对从用户查询返回的项目进行排序，在这里进行培训，我将对每个用户查询进行分组（如here所述）。

基于this和this，我了解SHAP并未在计算特征重要性时进行分组预测，这可能会导致结果有偏差。我对当地的解释很感兴趣，为什么对特定查询而言，特定项目排在首位。

要获取此信息，我仅在包含查询结果的数据上定义解释器，该查询包含我要解释的项目。示例代码为：

# the item for which we want to extract the local shapley values
item_of_interest = "ABC"
# subset the training data to include only this item
item_to_explain_df = data.loc[data.item == item_of_interest]
# find the query that generated this item
query_of_interest = data.loc[data.item == item_of_interest].query.unique()
# subset the training data to include the results from the query that generated the item of interest
data_for_one_query_df = data.loc[data.query == query_of_interest]

# define a tree explainer
# model is a LightGBM learning to rank model
# data_for_one_query_df is the data containing the results from the query that includes the item I am trying to explain
explainer = shap.TreeExplainer(model=model, data=data_for_one_query_df)

# get shapley values
# item_to_explain_df is the item I am trying to explain
shap_values = explainer.shap_values(data=item_to_explain_df)

# plot 
shap.force_plot(explainer.expected_value, shap_values, item_to_explain_df)

与基于整个培训数据定义解释器时相比，以上给出的结果截然不同。我认为，鉴于SHAP并未进行分组预测，因此在ONE查询中定义解释器并为其提取期望值更加准确，但是我没有找到任何支持文档。有人在组学习中使用SHAP进行算法排名有经验吗，我的方法正确吗？

为清楚起见，如果我将全部数据输入到解释器中，流程将是：

# define a tree explainer
explainer = shap.TreeExplainer(model=model)

# get shapley values
# item_to_explain is the item I am trying to explain
shap_values = explainer.shap_values(data=item_to_explain_df)

# plot 
shap.force_plot(explainer.expected_value, shap_values, item_to_explain_df)

从分组学习到排序算法中使用SHAP提取局部特征重要性的正确方法是什么

0 个答案: