Question

我正在进行一场Kaggle比赛（https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation），并且声明我的模型将通过以下方式进行评估：

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

我在文档中找不到这个（基本上是RMSE(log(truth), log(prediction)），所以我开始写一个自定义得分手：

def custom_loss(truth, preds):
    truth_logs = np.log(truth)
    print(truth_logs)
    preds_logs = np.log(preds)
    numerator = np.sum(np.square(truth_logs - preds_logs))
    return np.sum(np.sqrt(numerator / len(truth)))

custom_scorer = make_scorer(custom_loss, greater_is_better=False)

两个问题：

1）我的自定义损失函数是否应该返回一个numpy分数（每个（真，预测）对一个？或者它应该是那些（真值，预测）对的总损失，返回一个数字？

我查看了文档，但它们并不是非常有用的：我的自定义丢失功能应该返回。

2）当我跑步时：

xgb_model = xgb.XGBRegressor()
params = {"max_depth": [3, 4], "learning_rate": [0.05],
         "n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
                             n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)

grid_search_cv.fit(X, y)

grid_search_cv.best_score_

我回来了：

-0.12137097567803554

这是非常令人惊讶的。鉴于我的损失函数正在RMSE(log(truth) - log(prediction))，我不应该有负best_score_。

知道为什么会消极吗？

谢谢！

Answer 1

1）你应该返回一个数字作为损失，而不是数组。 GridSearchCV将根据该得分手的结果对params进行排序。

顺便说一下，您可以使用mean_squared_log_error而不是定义自定义指标，这样做可以满足您的需求。

2）为什么它会返回负数？ - 如果没有您的实际数据和完整的代码，我们无法说出来。

Answer 2

您应该小心使用该符号。

这里有2个优化级别：

将XGBRegressor拟合到数据时，损耗函数得到了优化。
在网格搜索过程中优化的评分功能。

我更喜欢调用第二个 scoring 函数而不是 loss 函数，因为损失函数通常是指在模型拟合过程中需要优化的术语。但是，您的自定义函数仅指定2.，而未更改1.。如果您想更改XGBRegressor的损失函数，请参见here。大多数回归模型都有几个条件供您选择，例如mean_square_error或mean_absolute_error。

请注意，目前尚不支持传递自定义损失函数（请参阅原因here和here）。

Answer 3

如果Greater_is_better为False，则make_scorer函数符号翻转

Scikit-Learn：GridSearchCV的自定义丢失功能

3 个答案: