我有一个看起来像这样的数据集:
df = pd.DataFrame({'sysbp':[106,121,121,105,108],
'diabp':[70,66,81,69,66],
'totchol':[195,209,250,260,237],
'ANYCHD':[1,1,0,0,0]})
ANYCHD是y变量。
我必须编写一个函数,将数据拆分为训练集和验证集。我设法使用以下功能来拆分数据,而无需使用函数:
from sklearn.model_selection import train_test_split
x, y, z, g = train_test_split(fr, fr['ANYCHD'], test_size = .2)
有人问我该函数只接受2个参数:df和size。
我尝试过:
def split_set(df, size):
"""
df: DataFrame to split
size: proportion of data to allocate to validation set (same as train_test_split's test_size)
"""
train.x, train.y, test.x, test.y = train_test_split(df, df[""], size)
return train.x, train.y, test.x, test.y
fr_train, fr_val, y_train, y_val = split_set(fr, fr['ANYCHD'])
我不能将def语句更改为采用df和size以外的参数,但是,其余的代码我可以更改。
目前,我在函数中仅使用2个参数而train_test_split使用3个参数时遇到了麻烦。
答案 0 :(得分:1)
ANYCHD是否始终是y变量?如果是,则解决方案非常简单:
def split_set(df, size):
"""
df: DataFrame to split
size: proportion of data to allocate to validation set (same as train_test_split's test_size)
"""
train.x, train.y, test.x, test.y = train_test_split(df, df['ANYCHD'], test_size = .2)
return train.x, train.y, test.x, test.y
fr_train, fr_val, y_train, y_val = split_set(fr, .2)
您只需在对train_test_split的调用中对Y信息(“ ANYCHD”)进行硬编码。
答案 1 :(得分:0)
IIUC,您正在尝试实现自己的train_test split版本,您可以做类似的事情
def split_data(df, size):
train_idx = np.random.choice(df.index, int(len(df)*size), replace = False)
test_idx = np.setdiff1d(df.index, train_idx)
train_x, train_y, test_x, test_y = df.iloc[train_idx, :-1], df.iloc[train_idx, -1], df.iloc[test_idx, :-1], df.iloc[test_idx, -1]
return train_x, train_y, test_x, test_y
split_data(df, .8)
( sysbp diabp totchol
4 108 66 237
2 121 81 250
3 105 69 260
1 121 66 209,
4 0
2 0
3 0
1 1
Name: ANYCHD, dtype: int64, sysbp diabp totchol
0 106 70 195, 0 1
Name: ANYCHD, dtype: int64)
答案 2 :(得分:0)
from sklearn.model_selection import train_test_split
def split_set(df, size):
"""
df: DataFrame to split
size: proportion of data to allocate to validation set (same as train_test_split's test_size)
"""
return train_test_split(df[['SYSBP', 'DIABP', 'TOTCHOL', 'DIABETES', 'CURSMOKE', 'BPMEDS']], df['ANYCHD'], test_size = size)
fr_train, fr_val, y_train, y_val = split_set(fr, .2)
fr_train.shape, fr_val.shape