Question

在sklearn的RF fit函数（或大多数fit（）函数）中，可以传递“ sample_weight”参数来加权不同的点。默认情况下，所有点的权重均相等，如果我将 1 数组作为sample_weight传递，则它确实与没有参数的原始模型匹配。

但是，如果我将 0.1 s或 1 / len（array）的数组作为sample_weight传递，它将改变模型（现在的预测有所不同），尽管仍然相等加权。这很麻烦，因为权重缩放似乎很重要。那么进行扩展的正确方法是什么，以便获得独特的解决方案？

以下示例：

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
boston = datasets.load_boston()

X = boston.data
y = boston.target

reg = RandomForestRegressor(random_state=1, n_estimators=10)
reg.fit(X, y)

reg_eq = RandomForestRegressor(random_state=1, n_estimators=10)
reg_eq.fit(X, y, sample_weight=np.full(len(y),1))

reg_eq_bad = RandomForestRegressor(random_state=1, n_estimators=10)
reg_eq_bad.fit(X, y, sample_weight=np.full(len(y),0.1))


xt = X[:20]
print(reg.predict(xt))
print(reg_eq.predict(xt))
print(reg_eq_bad.predict(xt))

np.testing.assert_array_almost_equal(reg.predict(xt),reg_eq.predict(xt))
np.testing.assert_array_almost_equal(reg.predict(xt),reg_eq_bad.predict(xt)) # 75% mismatch

Answer 1

如果将randomForestRegressor替换为简单的DecisionTreeRegressor，您会发现预测确实相等。

但是对于随机森林，如果您使用sample_weight参数欺骗输入数据，则由于这些模型引入的随机性/不确定性，无法确保预测将保持不变。< / p>

尽管如此，如果该模型能够按预期运行，则差异不应该很大。...

sklearn在fit（）中的随机森林sample_weight

1 个答案: