在Python中使用副本进行分层K折拆分

时间:2019-08-19 21:05:19

标签: python cross-validation

我有一个由几个测试系统组成的数据集,对于每个系统,已经重复进行了重复测试。

在所有系统上,副本的总数是恒定的(平衡的),但是每个副本可能包含不同数量的总观测值。

我想使用交叉验证将数据分为训练集和测试集,以便:

  1. 每个系统均在测试和培训集中代表
  2. 训练集将包含每个系统的所有副本,但1个副本,而测试集将包含每个系统的其余副本。
  3. 测试集中每个系统的观察值百分比与训练集中每个系统的观察值百分比

我原本希望使用sci-kit learning的StratifiedKFold函数,但是它似乎并不能满足我的需求。

例如,使用以下示例标签数据:

labels=np.concatenate([['Sys1']*35,['Sys2']*33,['Sys3']*36])
reps=np.concatenate([
    np.concatenate([
        ['Rep_0']*10,['Rep_1']*10,['Rep_2']*5,['Rep_3']*10]),
    np.concatenate([
        ['Rep_0']*8,['Rep_1']*10,['Rep_2']*10,['Rep_3']*5]),
    np.concatenate([
        ['Rep_0']*10,['Rep_1']*7,['Rep_2']*9,['Rep_3']*10])
])
frames=np.concatenate([
    np.concatenate([
        np.arange(10),np.arange(10),np.arange(5),np.arange(10)]),
    np.concatenate([
        np.arange(8),np.arange(10),np.arange(10),np.arange(5)]),
    np.concatenate([
        np.arange(10),np.arange(7),np.arange(9),np.arange(10)])
])
sampleKeys=np.array(map(lambda x,y: '.'.join([x,y]),
               labels,
               reps))

我尝试过分割标签:

cvSplitter=skl.model_selection.StratifiedKFold(n_splits=4)
iSplit=0
for train_indices, test_indices in cvSplitter.split(labels,labels):
    print '--- split %g ---'%iSplit
    print 'TRAIN:'
    for sample in np.array([np.unique(sampleKeys[train_indices],return_counts=True)[0],
           np.unique(sampleKeys[train_indices],return_counts=True)[1]]).T:
        print sample

    print 'TEST:'
    for sample in np.array([np.unique(sampleKeys[test_indices],return_counts=True)[0],
           np.unique(sampleKeys[test_indices],return_counts=True)[1]]).T:
        print sample

    iSplit=iSplit+1

但是,尽管结果包含训练与测试中每个系统的观察结果的百分比相等,但是训练集包括某些或所有系统的所有副本,而我希望训练集包括除一个之外的所有副本,而测试集包含缺少副本。

--- split 0 ---
TRAIN:
['Sys1.Rep_0' '1']
['Sys1.Rep_1' '10']
['Sys1.Rep_2' '5']
['Sys1.Rep_3' '10']
['Sys2.Rep_1' '9']
['Sys2.Rep_2' '10']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '1']
['Sys3.Rep_1' '7']
['Sys3.Rep_2' '9']
['Sys3.Rep_3' '10']
TEST:
['Sys1.Rep_0' '9']
['Sys2.Rep_0' '8']
['Sys2.Rep_1' '1']
['Sys3.Rep_0' '9']
--- split 1 ---
TRAIN:
['Sys1.Rep_0' '9']
['Sys1.Rep_1' '2']
['Sys1.Rep_2' '5']
['Sys1.Rep_3' '10']
['Sys2.Rep_0' '8']
['Sys2.Rep_1' '2']
['Sys2.Rep_2' '10']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '9']
['Sys3.Rep_2' '8']
['Sys3.Rep_3' '10']
TEST:
['Sys1.Rep_0' '1']
['Sys1.Rep_1' '8']
['Sys2.Rep_1' '8']
['Sys3.Rep_0' '1']
['Sys3.Rep_1' '7']
['Sys3.Rep_2' '1']
--- split 2 ---
TRAIN:
['Sys1.Rep_0' '10']
['Sys1.Rep_1' '8']
['Sys1.Rep_3' '8']
['Sys2.Rep_0' '8']
['Sys2.Rep_1' '9']
['Sys2.Rep_2' '3']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '10']
['Sys3.Rep_1' '7']
['Sys3.Rep_2' '1']
['Sys3.Rep_3' '9']
TEST:
['Sys1.Rep_1' '2']
['Sys1.Rep_2' '5']
['Sys1.Rep_3' '2']
['Sys2.Rep_1' '1']
['Sys2.Rep_2' '7']
['Sys3.Rep_2' '8']
['Sys3.Rep_3' '1']
--- split 3 ---
TRAIN:
['Sys1.Rep_0' '10']
['Sys1.Rep_1' '10']
['Sys1.Rep_2' '5']
['Sys1.Rep_3' '2']
['Sys2.Rep_0' '8']
['Sys2.Rep_1' '10']
['Sys2.Rep_2' '7']
['Sys3.Rep_0' '10']
['Sys3.Rep_1' '7']
['Sys3.Rep_2' '9']
['Sys3.Rep_3' '1']
TEST:
['Sys1.Rep_3' '8']
['Sys2.Rep_2' '3']
['Sys2.Rep_3' '5']
['Sys3.Rep_3' '9']
1
​

如果我拆分为“ reps”,最终会导致某些系统被排除在测试和/或训练数据之外。

1 个答案:

答案 0 :(得分:0)

到目前为止,我已经设法生成了一个功能,该功能将按代表重复进行,以确保训练数据包含每个系统的所有代表(除了一个代表),并且测试数据包含未包含在训练集中的代表(该函数只是到目前为止,它会吐出所需的索引)...对于如何确保每个系统中的样本数不对数据进行分层,我有些困惑。

使用原始测试示例

labels=np.concatenate([['Sys1']*35,['Sys2']*33,['Sys3']*36])
reps=np.concatenate([
    np.concatenate([
        ['Rep_0']*10,['Rep_1']*10,['Rep_2']*5,['Rep_3']*10]),
    np.concatenate([
        ['Rep_0']*8,['Rep_1']*10,['Rep_2']*10,['Rep_3']*5]),
    np.concatenate([
        ['Rep_0']*10,['Rep_1']*7,['Rep_2']*9,['Rep_3']*10])
])
frames=np.concatenate([
    np.concatenate([
        np.arange(10),np.arange(10),np.arange(5),np.arange(10)]),
    np.concatenate([
        np.arange(8),np.arange(10),np.arange(10),np.arange(5)]),
    np.concatenate([
        np.arange(10),np.arange(7),np.arange(9),np.arange(10)])
])
sampleKeys=np.array(map(lambda x,y: '.'.join([x,y]),
               labels,
               reps))

和拆分功能:

def stratifiedReplicaSplit(sysLabels,repLabels,nSplits):
    sampleLabels=map(lambda x,y: '.'.join([str(x),str(y)]),
                     sysLabels,repLabels)
    nSysLabels=len(np.unique(sysLabels))
    nRepLabels=len(np.unique(repLabels))

    outInds=[]

    sysTypes=np.sort(np.unique(sysLabels))
    repTypes=np.sort(np.unique(repLabels))
    #generate a list of all possible combinations for having a single
    #rep from each system.
    #we will shamelessly hack np.meshgrid to achieve this.
    #we then sample nSplits entries from this list (without replacement)
    tempRepSets=[np.arange(nRepLabels)]*nSysLabels
    comboSetGrids=np.array([np.array(tempGrid.flat) for tempGrid in np.meshgrid(*tempRepSets)])
    #we can now generate a unique split by sampling from the total set of all
    #possible combo sets (without replacement)
    comboInds=np.random.choice(comboSetGrids.shape[1],nSplits,replace=False)
    combosArray=comboSetGrids[:,comboInds].T
    print 'sysTypes',
    print sysTypes
    print 'repTypes',
    print repTypes
    print combosArray
    #print 'sample labels'
    #print sampleLabels
    for iSplit in np.arange(nSplits):
        comboSet=combosArray[iSplit].flatten()
        print comboSet
        outInds.append(
            [np.argwhere(
                np.product(
                    [map(lambda x: x!='%s.%s'%(
                                sysTypes[iEntry],
                                repTypes[entry]),
                           sampleLabels) \
                             for iEntry,entry in enumerate(comboSet)],
                    axis=0)).flatten(),
            np.argwhere(
                np.sum(
                    [map(lambda x: x=='%s.%s'%(
                                sysTypes[iEntry],
                                repTypes[entry]),
                           sampleLabels) \
                             for iEntry,entry in enumerate(comboSet)],
                    axis=0)>0).flatten()])
    return outInds

然后我运行:

cvSplitter=stratifiedReplicaSplit
iSplit=0
for train_indices, test_indices in cvSplitter(labels,reps,4):
    print '--- split %g ---'%iSplit
    print 'TRAIN:'
    for sample in np.array([np.unique(sampleKeys[train_indices],return_counts=True)[0],
           np.unique(sampleKeys[train_indices],return_counts=True)[1]]).T:
        print sample

    print 'TEST:'
    for sample in np.array([np.unique(sampleKeys[test_indices],return_counts=True)[0],
           np.unique(sampleKeys[test_indices],return_counts=True)[1]]).T:
        print sample

    iSplit=iSplit+1

并获得 sysTypes ['Sys1''Sys2''Sys3']

repTypes ['Rep_0' 'Rep_1' 'Rep_2' 'Rep_3']
[[2 0 1]
 [0 1 2]
 [3 0 1]
 [0 2 3]]
[2 0 1]
[0 1 2]
[3 0 1]
[0 2 3]
--- split 0 ---
TRAIN:
['Sys1.Rep_0' '10']
['Sys1.Rep_1' '10']
['Sys1.Rep_3' '10']
['Sys2.Rep_1' '10']
['Sys2.Rep_2' '10']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '10']
['Sys3.Rep_2' '9']
['Sys3.Rep_3' '10']
TEST:
['Sys1.Rep_2' '5']
['Sys2.Rep_0' '8']
['Sys3.Rep_1' '7']
--- split 1 ---
TRAIN:
['Sys1.Rep_1' '10']
['Sys1.Rep_2' '5']
['Sys1.Rep_3' '10']
['Sys2.Rep_0' '8']
['Sys2.Rep_2' '10']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '10']
['Sys3.Rep_1' '7']
['Sys3.Rep_3' '10']
TEST:
['Sys1.Rep_0' '10']
['Sys2.Rep_1' '10']
['Sys3.Rep_2' '9']
--- split 2 ---
TRAIN:
['Sys1.Rep_0' '10']
['Sys1.Rep_1' '10']
['Sys1.Rep_2' '5']
['Sys2.Rep_1' '10']
['Sys2.Rep_2' '10']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '10']
['Sys3.Rep_2' '9']
['Sys3.Rep_3' '10']
TEST:
['Sys1.Rep_3' '10']
['Sys2.Rep_0' '8']
['Sys3.Rep_1' '7']
--- split 3 ---
TRAIN:
['Sys1.Rep_1' '10']
['Sys1.Rep_2' '5']
['Sys1.Rep_3' '10']
['Sys2.Rep_0' '8']
['Sys2.Rep_1' '10']
['Sys2.Rep_3' '5']
['Sys3.Rep_0' '10']
['Sys3.Rep_1' '7']
['Sys3.Rep_2' '9']
TEST:
['Sys1.Rep_0' '10']
['Sys2.Rep_2' '10']
['Sys3.Rep_3' '10']

...因此,从本质上讲,我复制了'LeaveOneGroupOut'拆分的一种变体(更具体地说,省略了目标组对的一种组合)。 可以,但是测试数据中每个系统的相对样本量与训练数据中该系统的相对样本量不成比例...我需要以某种方式进行分层/平衡。例如。训练集中系统样本的百分比应与测试集中系统样本的百分比相匹配...