我目前正在尝试使用RandomForest进行预测,同时还使用k倍交叉验证来最小化min_samples_leaf的交叉验证错误。我目前无法设置代码,因为进入train_x = x[train_index]
时我一直遇到错误。我收到的错误显示在下面。
from sklearn import model_selection
kf = model_selection.KFold(n_splits=5)
x = train
y = test
for m in range(0, 10): # vary min_samples_leaf
dtr = ensemble.RandomForestRegressor(n_estimators = 15, min_samples_leaf = m, max_features = 10, criterion = 'mse')
for train_index, test_index in kf.split(x):
print("TRAIN:", train_index, "TEST:", test_index)
train_x = x[train_index]
train_y = y[test_index]
regr = dtr.fit(train_x, train_y)
KeyError:
None of [Int64Index([15546, 15547, 15548, 15549, 15550, 15551, 15552, 15553, 15554,\n 15555,\n ...\n 77718, 77719, 77720, 77721, 77722, 77723, 77724, 77725, 77726,\n 77727],\n dtype='int64', length=62182)] are in the [columns]
答案 0 :(得分:0)
您有从kf.split()中提取的全部值,而您要调用x [train_index]的train_index不仅在数组x中。
代码似乎正确,所以我怀疑“ train”(当然还有“ x”)中的数据格式有问题吗?
错误表明您的Int64Index类型(索引IIRC的熊猫类型)的值大于x的值(最大长度为62182),因此原始数据肯定存在问题。