我有一个像这样的数据集:
[ 5. , 2. , 15. , 0.25535303],
[ 5. , 3. , 15. , 6.72465845],
[ 5. , 4. , 15. , 5.62719504],
[ 5. , 5. , 15. , 5.61760597],
[ 5. , 6. , 15. , 4.9561533 ],
[ 6. , 2. , 15. , 0.2709665 ],
[ 6. , 3. , 15. , 6.07004364],
[ 6. , 4. , 15. , 5.62719504],
[ 6. , 5. , 15. , 5.54684885],
[ 6. , 6. , 15. , 5.32846201],
[ 2. , 2. , 20. , 3.79257349],
[ 2. , 3. , 20. , 4.00440964],
[ 2. , 4. , 20. , 4.37965706],
[ 2. , 5. , 20. , 3.92216922],
[ 2. , 6. , 20. , 3.41378368],
[ 3. , 2. , 20. , 0.13500398],
[ 3. , 3. , 20. , 4.38384781],
[ 3. , 4. , 20. , 5.17229688],
[ 3. , 5. , 20. , 5.00464056],
第三列的值从15到35。我想应用交叉验证,但我怀疑K折将在每个K块中仅在第三列中包含相同的值,这会对我产生负面影响模型。
因此,我的解决方法是:
dataset_shuffle = shuffle(dataset)
X = dataset_shuffle["A", "B", "C"]
y = dataset_shuffle["D"]
result = cross_validate(estimator,X,y,scoring=scoretypes,cv=5,return_train_score=False)
r2 = result['test_r2'].mean()
mselist = -result['test_neg_mean_squared_error']
rmse = np.sqrt(mselist).mean()
您是否认为这是解决我的问题的合适方法? 我的解决方案与此相同吗?:
X = dataset["A", "B", "C"]
y = dataset["D"]
cv = KFold(n_splits=5, shuffle=True)
result = cross_validate(estimator,X,y,scoring=scoretypes,cv=cv,return_train_score=False)