Question

我的工作几乎全部围绕线性回归展开，即使对模型预测的适度改进也会对我的工作有很大帮助。

我在最近的ICML论文（https://arxiv.org/pdf/1904.02868.pdf）中读到，在高形状值数据上训练模型不仅会在训练数据集上而且会在较大的（测试+训练）数据集上提高模型准确性。

我尝试使用shapley值实现（https://github.com/slundberg/shap），但是虽然在高shap值下训练的模型确实可以改善其预测，但是该模型无法提高较大（测试+训练）数据集的得分。

R2：整个数据集的线性回归：0.54 R2：子集的线性回归（主要由高整形值决定）：0.65 R2：上面在完整数据集上训练的线性回归：0.48

我相信我在使用shap值时会出错，如果有人可以指出错误，那就太好了。

import pandas as pd
import shap
from sklearn.linear_model import LinearRegression

# Import Boston dataset
x, y = shap.datasets.boston()

# Merge x and y to create a single dataframe data
x["y"] = y
data = x

# Creating linear regression with a single X "LSTAT"; R2 stands at 0.54
X_full = data[["LSTAT"]]
y_full = data[["y"]]
reg_full_data = LinearRegression().fit(X_full, y_full)
print(reg_full_data.score(X_full, y_full))

# Answer 0.5441462975864799

# Explain linear models predictions using SHAP values
explainer = shap.LinearExplainer(reg_full_data, X_full, feature_dependence="independent")
shap_values = explainer.shap_values(X_full)

# Merging shap values with the data frame
data["shap_values"] = shap_values

# Converting shap values to positive to sample rows with higher shap values
data[["shap_values"]] = data[[shap_values]].abs()

# Sampling rows with higher weightage on higher shap values
data_shap = data.sample(frac=0.1, weights='shap_values', random_state=1)

# Creating linear regression with sampled rows (dominated by rows with higher shap values) ; R2 increases to 0.65
x_train = data_shap[["LSTAT"]]
y_train = data_shap[["y"]]
reg_shap = LinearRegression().fit(x_train, y_train)
print(reg_shap.score(x_train, y_train))

# Answer 0.6533205248929236

# Using linear regression trained on high shap value data to predict for entire data set ; R2 is lower 0.48 compared to baseline R2 of 0.54
print(reg_shap.score(X_full, y_full))

# Answer 0.48149775302155906

使用Shap值改善线性回归预测

0 个答案: