Question

我有一个包含100,000个样本和2个目标的数据集{通过，失败} 我的主要目标是从数据集中随机抽取5个块（组），并连续提取5个目标为“失败”的样本。

我的数据集：

i           target             value
0            Fail               12
1            Fail               12
2            Fail               14
3            Fail               13
4            Fail               8 
5            Pass               40
6            Fail               12
7            Fail               7
8            Fail               9
9            Fail               11
10           Fail               19
11           Pass               44
12           Fail               16
13           Fail               4
.........................................
n

对于上述数据集

List1 = [0,1,2,3,4]

List2 = [6,7,8,9,10]

可以是正确的结果；但是，应该从整个数据集中随机选择这些块。

P.S：数据集存储在Excel工作表中，并使用熊猫导入。

Answer 1

我首先确定所有可接受大小的块，然后在该列表中随机选择。以下代码假定采用RangeIndex（从0到len-1编号）形式的简单数字索引。如果索引不同，请使用reset_index获得一个干净的RangeIndex。

s = pd.Series(np.where(df2.target=='Fail', 1, np.nan), index=df2.index)
ends = np.random.choice(s[s.rolling(5).count()==5].index.values, 5)

ends包含Fail中5个连续df行的5个随机序列的终止索引。

Answer 2

您可以定义一个函数，在一个numpy数组中（如果有的话）随机选择n个5个连续数字的块。

def get_chunks(x, n):
    chunks = np.split(x, np.where(np.diff(x) != 1)[0]+1)  # split consecutives
    chunks = [c for c in chunks if len(c) >=5]  # get only chuncks with more than 5 elements
    if len(chunks) >= n:
        n_chunks = [chunks[i] for i in np.random.choice(range(len(chunks)), n, replace=False)]  # choose n chunks
        rs = [np.random.choice(np.arange(0, len(chunk) -4)) for chunk in chunks]  # get 5 elements from each chunk
        return [n_chunks[i][rs[i]: rs[i]+5] for i in range(len(n_chunks))]
    else:
        return None

然后将其应用于数据框的索引。通过您的示例，我们将获得以下信息：

In [1]: indices = df.reset_index().groupby('target')['index'].apply(np.array)['Fail']
        get_chunk(indices, 2)        
Out[1]: [array([ 6,  7,  8,  9, 10]), array([0, 1, 2, 3, 4])]

如何从数据集中提取n个块？

2 个答案: