我正在使用泰坦尼克号数据集实现随机森林回归器。
以下是它的样子:
from sklearn.ensemble import RandomForestRegressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
reg_rf = RandomForestRegressor(random_state=1) # by default, 10 trees are used
reg_rf.fit(X_train, y_train)
rfc_train_score = reg_rf.score(X_train, y_train)
rfc_test_score = reg_rf.score(X_test, y_test)
print ('train accuracy =', rfc_train_score)
print ('test accuracy =', rfc_test_score)
我获得以下输出:
train accuracy = 0.988660049497
test accuracy = 0.942596699112
但是当我尝试在这个模型上进行交叉验证时:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(reg_rf, X, y, scoring='r2', cv=5)
print(scores)
它给了我:
[ 0.57775117 0.88199732 0.69066105 0.90320741 0.87953982]
正如您所看到的,分数彼此非常不同。 我该如何解释这种行为?
我正在运行Python 3.6.x