Question

考虑具有权重的N行的数据集。这是基本算法：

将权重标准化，使它们总和为1。
将权重备份到另一列以记录样本概率
在给定样本概率的情况下，随机选择1行（无替换），并将其添加到样本数据集
从原始数据集中删除绘制的权重，并通过规范化剩余行的权重来重新计算样本概率
重复步骤3和4，直到样品中的重量总和达到或超过阈值（假设为0.6）

这是一个玩具示例：

import pandas as pd
import numpy as np

def sampler(n):
    df = pd.DataFrame(np.random.rand(n), columns=['weight'])
    df['weight'] = df['weight']/df['weight'].sum()
    df['samp_prob'] = df['weight']

    samps = pd.DataFrame(columns=['weight'])

    while True:
        choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
        samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
        df.drop(choice, axis=0, inplace=True)
        df['samp_prob'] = df['weight']/df['weight'].sum()
        if samps['weight'].sum() >= 0.6:
            break
    return samps

玩具示例的问题是随着n大小的增加，运行时间呈指数增长：

Answer 1

开始进场

很少有观察结果：

每次迭代丢弃导致创建新数据帧的行无助于提高性能。
看起来不容易矢量化，但是应该很容易使用底层数组数据来提高性能。我们的想法是使用掩码并避免重新创建数据帧或数组。首先，我们将使用两个列数组，对应于名为'weights'和'samp_prob'的列。

所以，考虑到这些，开始的方法将是这样的 -

def sampler2(n):
    a = np.random.rand(n,2)
    a[:,0] /= a[:,0].sum()
    a[:,1] = a[:,0]
    N = len(a)

    idx = np.arange(N)
    mask = np.ones(N,dtype=bool)
    while True:
        choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
        mask[choice] = 0
        a_masked = a[mask,0]
        a[mask,1] = a_masked/a_masked.sum()
        if a[~mask,0].sum() >= 0.6:
            break
    out = a[~mask,0]
    return out

改进＃1

后来的观察发现，数组的第一列在迭代中没有变化。因此，我们可以通过预先计算总和来优化第一列的蒙板总和，然后在每次迭代时，a[~mask,0].sum()将只是总和减去a_masked.sum()。 Thsi引导我们进行第一次改进，列在下面 -

def sampler3(n):
    a = np.random.rand(n,2)
    a[:,0] /= a[:,0].sum()
    a[:,1] = a[:,0]
    N = len(a)

    idx = np.arange(N)
    mask = np.ones(N,dtype=bool)
    a0_sum = a[:,0].sum()
    while True:
        choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
        mask[choice] = 0
        a_masked = a[mask,0]
        a_masked_sum = a_masked.sum()
        a[mask,1] = a_masked/a_masked_sum
        if a0_sum - a_masked_sum >= 0.6:
            break
    out = a[~mask,0]
    return out

改进＃2

现在，可以通过使用两个单独的数组来改进对2D数组的列的切片和屏蔽，前提是第一列在迭代之间没有变化。这给了我们一个修改过的版本，就像这样 -

def sampler4(n):
    a = np.random.rand(n)
    a /= a.sum()
    b = a.copy()
    N = len(a)

    idx = np.arange(N)
    mask = np.ones(N,dtype=bool)
    a_sum = a.sum()
    while True:
        choice = np.random.choice(idx[mask], 1, replace=False, p=b[mask])[0]
        mask[choice] = 0
        a_masked = a[mask]
        a_masked_sum = a_masked.sum()
        b[mask] = a_masked/a_masked_sum
        if a_sum - a_masked_sum >= 0.6:
            break
    out = a[~mask]
    return out

运行时测试 -

In [250]: n = 1000

In [251]: %timeit sampler(n) # original app
     ...: %timeit sampler2(n)
     ...: %timeit sampler3(n)
     ...: %timeit sampler4(n)
1 loop, best of 3: 655 ms per loop
10 loops, best of 3: 50 ms per loop
10 loops, best of 3: 44.9 ms per loop
10 loops, best of 3: 38.4 ms per loop

In [252]: n = 2000

In [253]: %timeit sampler(n) # original app
     ...: %timeit sampler2(n)
     ...: %timeit sampler3(n)
     ...: %timeit sampler4(n)
1 loop, best of 3: 1.32 s per loop
10 loops, best of 3: 134 ms per loop
10 loops, best of 3: 119 ms per loop
10 loops, best of 3: 100 ms per loop

因此，我们获得 17x+ 和 13x+ 加速，其最终版本优于n=1000和{的原始方法{1}}尺寸！

Answer 2

我认为你可以在一次循环中重写这个循环：

while True:
    choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
    samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
    df.drop(choice, axis=0, inplace=True)
    df['samp_prob'] = df['weight']/df['weight'].sum()
    if samps['weight'].sum() >= 0.6:
        break

更像是：

n = len(df.index)
ind = np.random.choice(n, n, replace=False, p=df["samp_prob"])
res = df.iloc[ind]
i = (res.cumsum() >= 0.6).idxmax()  # first index that satisfies .sum() >= 0.6
samps = res.iloc[:i+1]

关键部分是选择可以采用多个元素（实际上是整个数组），同时仍然尊重概率。 cumsum允许您在超过0.6阈值后切断。

在这个例子中，你可以看到数组是随机选择的，但是4最有可能选择在顶部附近。

In [11]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[11]: array([0, 4, 3, 2, 1])

In [12]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[12]: array([3, 4, 1, 2, 0])

In [13]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[13]: array([0, 4, 3, 1, 2])

In [14]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[14]: array([4, 3, 0, 2, 1])

In [15]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[15]: array([4, 2, 3, 0, 1])

In [16]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[16]: array([3, 4, 2, 0, 1])

注意：replace = False，确保概率重新称重＆＃34;从某种意义上说，它无法再次被选中。

如何在没有替换的情况下进行采样并每次重新称重（条件采样）？

2 个答案:

开始进场

改进＃1

改进＃2