Question

在numpy中，我有一个像这样的数据集。前两列是索引。我可以通过索引将我的数据集划分为块，即第一个块是0 0秒块是0 1个第三个块0 2然后是1个，1个，1个等等，依此类推。每个块至少有两个元素。 index列中的数字可以变化

我需要随机地沿着这些块分割数据集80％-20％，这样在分割后，两个数据集中的每个块都至少有1个元素。我怎么能这样做？

indices | real data
        |
0   0   | 43.25 665.32 ...  } 1st block
0   0   | 11.234            }
0   1     ...               } 2nd block
0   1                       } 
0   2                       } 3rd block
0   2                       }
1   0                       } 4th block
1   0                       }
1   0                       }
1   1                       ...
1   1                       
1   2
1   2
2   0
2   0 
2   1
2   1
2   1
...

Answer 1

看看你喜欢这个。为了引入随机性，我正在改组整个数据集。这是我想办法分割矢量化的唯一方法。也许你可以简单地改变一个索引数组，但这对我今天的大脑来说太多了。我还使用了结构化数组，以便于提取块。首先，让我们创建一个样本数据集：

from __future__ import division
import numpy as np

# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)

items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)

dataset = np.empty((items+5,), [('idx1', np.int), ('idx2', np.int),
                             ('data', np.float)])
dataset['idx1'][:2*c1*c2] =  np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3

现在分层抽样：

# For randomness, shuffle the entire array
np.random.shuffle(dataset)

blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)

x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B

a_idx = threshold > np.random.rand(len(dataset))

A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]

运行后，拆分大约为80/20，所有块都在两个数组中表示：

>>> len(A)
815
>>> len(B)
190
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
True

Answer 2

这是另一种解决方案。如果可以以更加简洁的方式实现这一点（没有for循环），我愿意接受代码审查。 @Jamie的回答非常好，有时它会在数据块中产生偏差比率。

    ratio = 0.8
    IDX1 = 0
    IDX2 = 1
    idx1s = np.arange(len(np.unique(self.data[:,IDX1])))
    idx2s = np.arange(len(np.unique(self.data[:,IDX2])))
    valid = None
    train = None
    for i1 in idx1s:
        for i2 in idx2:
            mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
            curr_data = data[mask,:]
            np.random.shuffle(curr_data)
            start = np.min(mask)
            end = np.max(mask)
            thres = start + np.around((end - start) * ratio).astype(np.int)

            selected = mask < thres
            train_idx = mask[0][selected[0]]
            valid_idx = mask[0][~selected[0]]
            if train != None:
                train = np.vstack((train,data[train_idx]))
                valid = np.vstack((valid,data[valid_idx]))
            else:
                train = data[train_idx]
                valid = data[valid_idx]

Answer 3

我假设每个块至少有两个条目，如果它有两个以上，你希望它们尽可能接近80/20。最简单的方法似乎是为所有行分配一个随机数，然后根据每个分层样本中的百分位数进行选择。假设这是文件strat_sample.csv中的数据：

Index_1,Index_2,Data_1,Data_2
0,0,0.614583182,0.677644482
0,0,0.321384981,0.598450854
0,0,0.303029607,0.300593782
0,0,0.646010758,0.612006715
0,0,0.484572883,0.30052535
0,1,0.010625416,0.118671475
0,1,0.428967984,0.23795173
0,1,0.523440618,0.457275922
0,1,0.379612652,0.337640868
0,1,0.338180659,0.206399031
1,0,0.079386,0.890939911
1,0,0.572864624,0.725615079
1,0,0.045891404,0.300128917
1,0,0.578792198,0.100698871
1,0,0.776485138,0.475135948
1,0,0.401850419,0.784835723
1,1,0.087660923,0.497299605
1,1,0.8460978,0.825774802
1,1,0.526015021,0.581905971
1,1,0.23324672,0.299475291

然后这段代码（使用Pandas数据结构）按需运行

import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow

def TreatmentOneCount(n , *args):
    #assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1. 
    OptimalRatio = args[0]
    if n < 2:
        print("N too small, assignment not defined.")
        a = NaN
    elif n == 2:
        a = 1
    else:
        """
        There are one of two numbers that are close to the target ratio, one above, the other below
        If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
        If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
        """
        targetassigment = OptimalRatio * n
        if  targetassigment - floor(targetassigment) > 0.5:
            a = min(ceil(targetassigment),n-1)
        else:
            a = max(floor(targetassigment),1)
    return a


df = pd.read_csv('strat_sample.csv', sep=','  , header=0)

#assign a random number to each entry
df['RandScore'] =  np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)

#Within each block assign a rank based on random number. 
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()

#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)

#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count()
seriestest.name = "Counts"
dftest = pd.DataFrame(seriestest)

#Add the block counts to the data
df = df.merge(dftest, how='left',  left_on = 'MasterIdx', right_index= True)

#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <=  df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)

Answer 4

from sklearn import cross_validation

X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

numpy中的分层抽样

4 个答案: