一次将熊猫数据框随机分成几组,以进行x倍交叉验证

时间:2018-09-30 05:27:13

标签: python pandas dataframe machine-learning

假设我有一个包含500行的数据框。我想执行10倍交叉验证。因此,我需要将此数据分为10组,每组包含50行。我也想一次随机将整个数据分为10个组。

有没有办法使用诸如pandas,numpy等之类的任何库?

1 个答案:

答案 0 :(得分:1)

您可以使用sklearn的KFold

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold 

# create dummy dataframe with 500 rows
features = np.random.randint(1, 100, 500)
labels = np.random.randint(1, 100, 500)
df = pd.DataFrame(data = {"X": features, "Y": labels})

kf = KFold(n_splits=10, random_state=42, shuffle=True) # Define the split - into 10 folds 
kf.get_n_splits(df) # returns the number of splitting iterations in the cross-validator
print(kf) 

for train_index, test_index in kf.split(df):
    print("TRAIN:", train_index) 
    print("TEST:", test_index)
    X_train, X_test = df.loc[train_index, "X"], df.loc[test_index, "X"]
    y_train, y_test = df.loc[train_index, "Y"], df.loc[test_index, "Y"]

示例taken from here