有没有办法方便地将数据集拆分为训练和测试集,将属于同一组的记录保存在一起?
以一个表格为例,记录每个person_id
的独立变量和因变量,以便每个人都有一个或多个条目:
import pandas as pd
tbl = pd.DataFrame(dict(
person_id=list('aaabbcccdeeefffhiijj'),
random_variable=np.linspace(0, 1, 20),
dependent_variable=np.arange(20)
))
现在,我想将数据拆分为训练和测试集,将属于同一个人的记录保存在同一个数据集中。显然,使用sklearn.cross_validation.train_test_split
不起作用。我知道sklearn.cross_validation.LeavePLabelOut
,但它不是创建一个分割,而是创建所有可能的组合,这不是我目前想要的。
另一种方法是根据person_id
字段计算哈希值并将其用于采样:
import numpy as np
salt = str(np.random.rand()) # randomness source
hash_values = tbl['person_id'].apply(lambda p: hash(salt + p) % 100)
# 50/50 split
sel_training = hash_values < 50
training_set = tbl.loc[sel_training]
testing_set = tbl.loc[-sel_training]
有更优雅的方式来完成这项任务吗?
答案 0 :(得分:2)
我开始编写自己的交叉验证课程来完成您所说的内容。这是代码(抱歉,它不是非常干净)。
class StratifiedKFold_ByColumn( object ):
def __init__( self, n_folds, X, y, colname ):
groupable = pd.concat( [X[colname], y], axis=1 )
grouped = groupable.groupby( [colname] ).aggregate( max )
self.column = X[colname]
self.colname = colname
# import pdb; pdb.set_trace()
self.folds = [
(train,val) for (train,val) in
sklearn.cross_validation.StratifiedKFold( y=grouped.values[:,0], n_folds=n_folds, shuffle=True )
]
self.n_folds = n_folds
self.i = 0
self.y=y
# self.test()
def __len__(self):
return self.n_folds
def __iter__( self ):
self.i = 0
return self
def test( self ):
for train,val in self.folds:
train_mask = self.column.isin( train )
val_mask = self.column.isin( val )
print 'train:',self.y[train_mask].sum(), (1-self.y[train_mask]).sum()
print 'val:',self.y[val_mask].sum(), (1-self.y[val_mask]).sum()
def next( self ):
if self.i < self.n_folds:
train,val = self.folds[self.i]
self.i += 1
# import pdb; pdb.set_trace()
train_mask = self.column.isin( train )
val_mask = self.column.isin( val )
y_train = self.y[train_mask]
X_train = self.column[train_mask]
n_tr_1 = (y_train!=0).sum()
n_tr_0 = (y_train==0).sum()
# import pdb; pdb.set_trace()
assert n_tr_1 < n_tr_0
stride = n_tr_0/n_tr_1
X_train_1 = X_train[y_train!=0]
y_train_1 = y_train[y_train!=0]
X_train_0 = X_train[y_train==0]
y_train_0 = y_train[y_train==0]
train_idxs = []
for i_1 in range(0,n_tr_1):
train_idxs.append( X_train_1[i_1:(i_1+1)].index )
train_idxs.append( X_train_0[i_1*stride:(i_1+1)*stride].index )
train_idxs = flatten(train_idxs)
val_idxs = val_mask[val_mask].index
return np.array(train_idxs), np.array(val_idxs)
else:
raise StopIteration()