Python-如何使用scikit创建将数据分为训练和验证的函数

时间:2019-11-25 22:31:21

标签: python pandas scikit-learn

我有一个看起来像这样的数据集:

df = pd.DataFrame({'sysbp':[106,121,121,105,108], 
                'diabp':[70,66,81,69,66], 
                'totchol':[195,209,250,260,237],
                'ANYCHD':[1,1,0,0,0]})

ANYCHD是y变量。

我必须编写一个函数,将数据拆分为训练集和验证集。我设法使用以下功能来拆分数据,而无需使用函数:

from sklearn.model_selection import train_test_split
x, y, z, g = train_test_split(fr, fr['ANYCHD'], test_size = .2) 

有人问我该函数只接受2个参数:df和size。

我尝试过:

def split_set(df, size):
    """ 
    df: DataFrame to split
    size: proportion of data to allocate to validation set (same as train_test_split's test_size)
    """
    train.x, train.y, test.x, test.y = train_test_split(df, df[""], size)
    return train.x, train.y, test.x, test.y 

fr_train, fr_val, y_train, y_val = split_set(fr, fr['ANYCHD'])

我不能将def语句更改为采用df和size以外的参数,但是,其余的代码我可以更改。

目前,我在函数中仅使用2个参数而train_test_split使用3个参数时遇到了麻烦。

3 个答案:

答案 0 :(得分:1)

ANYCHD是否始终是y变量?如果是,则解决方案非常简单:

def split_set(df, size):
    """ 
    df: DataFrame to split
    size: proportion of data to allocate to validation set (same as train_test_split's test_size)
    """
    train.x, train.y, test.x, test.y = train_test_split(df, df['ANYCHD'], test_size = .2) 
    return train.x, train.y, test.x, test.y 

fr_train, fr_val, y_train, y_val = split_set(fr, .2)

您只需在对train_test_split的调用中对Y信息(“ ANYCHD”)进行硬编码。

答案 1 :(得分:0)

IIUC,您正在尝试实现自己的train_test split版本,您可以做类似的事情

def split_data(df, size):
    train_idx = np.random.choice(df.index, int(len(df)*size), replace = False)
    test_idx = np.setdiff1d(df.index, train_idx)
    train_x, train_y, test_x, test_y = df.iloc[train_idx, :-1], df.iloc[train_idx, -1], df.iloc[test_idx, :-1], df.iloc[test_idx, -1]
    return train_x, train_y, test_x, test_y

split_data(df, .8)

(   sysbp  diabp  totchol
 4    108     66      237
 2    121     81      250
 3    105     69      260
 1    121     66      209, 
 4    0
 2    0
 3    0
 1    1
 Name: ANYCHD, dtype: int64,    sysbp  diabp  totchol
 0    106     70      195, 0    1
 Name: ANYCHD, dtype: int64)

答案 2 :(得分:0)

from sklearn.model_selection import train_test_split

def split_set(df, size):
    """ 
    df: DataFrame to split
    size: proportion of data to allocate to validation set (same as train_test_split's test_size)
    """
    return train_test_split(df[['SYSBP', 'DIABP', 'TOTCHOL', 'DIABETES', 'CURSMOKE', 'BPMEDS']], df['ANYCHD'], test_size = size)


fr_train, fr_val, y_train, y_val = split_set(fr, .2) 
fr_train.shape, fr_val.shape