Question

假设我有10个独立的数据集，我想建立一个预测模型。我需要评估模型，所以我使用交叉验证。如何将每个数据集用作CV中的折叠或特定部分？例如，如何将前9个数据集用作训练集，将第10个数据集用作测试集，然后迭代所有数据集？这样，不会随机选择训练和测试数据集。是否有任何python函数可以执行它？

Answer 1

如果您的数据集大小相同，并且使用pd.concat（[df1，df2 ... df10]将它们组合在一起），您应该能够使用sklearn的KFold实现您想要的效果。 ignore_index = True）。默认情况下，随机播放处于关闭状态，您可以使用n_splits指定折叠次数。后者的默认值是3.这是一个例子：

import pandas as pd

# Load a data frame
df = pd.read_csv('C:\df.csv')   
print(df)

#                           CROSS VALIDATION

from sklearn.model_selection import KFold

# Instantiate KFold
kf= KFold(n_splits = 2)

X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# Show the indices of the train and test sets
print('kf indices:')
for train_index, test_index in kf.split(X):
    print(train_index, test_index)

    A    B       Name   Surname      Country  Points
0  96  100      Roger   Federer  Switzerland  9600.0
1  80  100     Grigor  Dimitrov     Bulgaria  8000.0
2  72  100    Dominic     Thiem      Austria  7200.0
3  65  100      Pablo     Busta        Spain  6500.0
4  58  100       Stan  Wawrinka  Switzerland  5800.0
5  56  100       Jack      Sock          USA  5600.0
6  44  100      Marin     Cilic      Croatia  4400.0
7  43  100      David    Goffin      Belgium  4300.0
8  25  100  Alexander    Zverev      Germany  2500.0
9  14  100     Rafael     Nadal        Spain  1400.0

kf indices:
[5 6 7 8 9] [0 1 2 3 4]
[0 1 2 3 4] [5 6 7 8 9]

通过python进行交叉验证的非随机选择训练和测试数据集

1 个答案: