我试图在xgboost中实现增强泊松回归模型,但我发现结果在低频时有偏差。为了说明,这里有一些我认为复制该问题的最小Python代码:
import numpy as np
import pandas as pd
import xgboost as xgb
def get_preds(mult):
# generate toy dataset for illustration
# 4 observations with linearly increasing frequencies
# the frequencies are scaled by `mult`
dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
label=[i*mult for i in [1, 2, 3, 4]],
weight=[1000, 1000, 1000, 1000])
# train a poisson booster on the toy data
bst = xgb.train(
params={"objective": "count:poisson"},
dtrain=dmat,
num_boost_round=100000,
early_stopping_rounds=5,
evals=[(dmat, "train")],
verbose_eval=False)
# return fitted frequencies after reversing scaling
return bst.predict(dmat)/mult
# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies
mults = [10**i for i in range(-8, 1)]
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df
# --- result ---
# (0, 0) (0, 1) (1, 0) (1, 1)
#1.000000e-08 11598.0 11598.0 11598.0 11598.0
#1.000000e-07 1161.0 1161.0 1161.0 1161.0
#1.000000e-06 118.0 118.0 118.0 118.0
#1.000000e-05 12.0 12.0 12.0 12.0
#1.000000e-04 2.0 2.0 3.0 3.0
#1.000000e-03 1.0 2.0 3.0 4.0
#1.000000e-02 1.0 2.0 3.0 4.0
#1.000000e-01 1.0 2.0 3.0 4.0
#1.000000e+00 1.0 2.0 3.0 4.0
请注意,在低频率下,预测似乎会爆炸。这可能与Poisson lambda *的重量低于1(实际上增加1000以上的重量确实会使#34;爆炸"更低的频率)有关,但我仍然期望预测接近平均训练频率(2.5)。此外(上面的示例中未显示),减少eta
似乎会增加预测中的偏差量。
会导致这种情况发生的原因是什么?是否有可以减轻影响的参数?
答案 0 :(得分:4)
经过一番挖掘后,我找到了解决方案。记录此处,以防其他人遇到同样的问题。事实证明我需要添加一个等于平均频率(自然)对数的偏移项。如果这不是立即显而易见的,那是因为初始预测的开始频率为0.5,并且需要进行许多增强迭代才能将预测重新调整为平均频率。
有关玩具示例的更新,请参阅以下代码。正如我在原始问题中所建议的那样,预测现在接近较低尺度的平均频率(2.5)。
import numpy as np
import pandas as pd
import xgboost as xgb
def get_preds(mult):
# generate toy dataset for illustration
# 4 observations with linearly increasing frequencies
# the frequencies are scaled by `mult`
dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
label=[i*mult for i in [1, 2, 3, 4]],
weight=[1000, 1000, 1000, 1000])
## adding an offset term equal to the log of the mean frequency
offset = np.log(np.mean([i*mult for i in [1, 2, 3, 4]]))
dmat.set_base_margin(np.repeat(offset, 4))
# train a poisson booster on the toy data
bst = xgb.train(
params={"objective": "count:poisson"},
dtrain=dmat,
num_boost_round=100000,
early_stopping_rounds=5,
evals=[(dmat, "train")],
verbose_eval=False)
# return fitted frequencies after reversing scaling
return bst.predict(dmat)/mult
# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies
mults = [10**i for i in range(-8, 1)]
## round to 1 decimal point to show the result approaches 2.5
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 1))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df
# --- result ---
# (0, 0) (0, 1) (1, 0) (1, 1)
#1.000000e-08 2.5 2.5 2.5 2.5
#1.000000e-07 2.5 2.5 2.5 2.5
#1.000000e-06 2.5 2.5 2.5 2.5
#1.000000e-05 2.5 2.5 2.5 2.5
#1.000000e-04 2.4 2.5 2.5 2.6
#1.000000e-03 1.0 2.0 3.0 4.0
#1.000000e-02 1.0 2.0 3.0 4.0
#1.000000e-01 1.0 2.0 3.0 4.0
#1.000000e+00 1.0 2.0 3.0 4.0