考虑具有权重的N行的数据集。这是基本算法:
这是一个玩具示例:
import pandas as pd
import numpy as np
def sampler(n):
df = pd.DataFrame(np.random.rand(n), columns=['weight'])
df['weight'] = df['weight']/df['weight'].sum()
df['samp_prob'] = df['weight']
samps = pd.DataFrame(columns=['weight'])
while True:
choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
df.drop(choice, axis=0, inplace=True)
df['samp_prob'] = df['weight']/df['weight'].sum()
if samps['weight'].sum() >= 0.6:
break
return samps
玩具示例的问题是随着n
大小的增加,运行时间呈指数增长:
答案 0 :(得分:2)
很少有观察结果:
'weights'
和'samp_prob'
的列。所以,考虑到这些,开始的方法将是这样的 -
def sampler2(n):
a = np.random.rand(n,2)
a[:,0] /= a[:,0].sum()
a[:,1] = a[:,0]
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
mask[choice] = 0
a_masked = a[mask,0]
a[mask,1] = a_masked/a_masked.sum()
if a[~mask,0].sum() >= 0.6:
break
out = a[~mask,0]
return out
后来的观察发现,数组的第一列在迭代中没有变化。因此,我们可以通过预先计算总和来优化第一列的蒙板总和,然后在每次迭代时,a[~mask,0].sum()
将只是总和减去a_masked.sum()
。 Thsi引导我们进行第一次改进,列在下面 -
def sampler3(n):
a = np.random.rand(n,2)
a[:,0] /= a[:,0].sum()
a[:,1] = a[:,0]
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
a0_sum = a[:,0].sum()
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=a[mask,1])[0]
mask[choice] = 0
a_masked = a[mask,0]
a_masked_sum = a_masked.sum()
a[mask,1] = a_masked/a_masked_sum
if a0_sum - a_masked_sum >= 0.6:
break
out = a[~mask,0]
return out
现在,可以通过使用两个单独的数组来改进对2D数组的列的切片和屏蔽,前提是第一列在迭代之间没有变化。这给了我们一个修改过的版本,就像这样 -
def sampler4(n):
a = np.random.rand(n)
a /= a.sum()
b = a.copy()
N = len(a)
idx = np.arange(N)
mask = np.ones(N,dtype=bool)
a_sum = a.sum()
while True:
choice = np.random.choice(idx[mask], 1, replace=False, p=b[mask])[0]
mask[choice] = 0
a_masked = a[mask]
a_masked_sum = a_masked.sum()
b[mask] = a_masked/a_masked_sum
if a_sum - a_masked_sum >= 0.6:
break
out = a[~mask]
return out
运行时测试 -
In [250]: n = 1000
In [251]: %timeit sampler(n) # original app
...: %timeit sampler2(n)
...: %timeit sampler3(n)
...: %timeit sampler4(n)
1 loop, best of 3: 655 ms per loop
10 loops, best of 3: 50 ms per loop
10 loops, best of 3: 44.9 ms per loop
10 loops, best of 3: 38.4 ms per loop
In [252]: n = 2000
In [253]: %timeit sampler(n) # original app
...: %timeit sampler2(n)
...: %timeit sampler3(n)
...: %timeit sampler4(n)
1 loop, best of 3: 1.32 s per loop
10 loops, best of 3: 134 ms per loop
10 loops, best of 3: 119 ms per loop
10 loops, best of 3: 100 ms per loop
因此,我们获得 17x+
和 13x+
加速,其最终版本优于n=1000
和{的原始方法{1}}尺寸!
答案 1 :(得分:0)
我认为你可以在一次循环中重写这个循环:
while True:
choice = np.random.choice(df.index, 1, replace=False, p=df['samp_prob'])[0]
samps.loc[choice, 'weight'] = df.loc[choice, 'weight']
df.drop(choice, axis=0, inplace=True)
df['samp_prob'] = df['weight']/df['weight'].sum()
if samps['weight'].sum() >= 0.6:
break
更像是:
n = len(df.index)
ind = np.random.choice(n, n, replace=False, p=df["samp_prob"])
res = df.iloc[ind]
i = (res.cumsum() >= 0.6).idxmax() # first index that satisfies .sum() >= 0.6
samps = res.iloc[:i+1]
关键部分是选择可以采用多个元素(实际上是整个数组),同时仍然尊重概率。 cumsum允许您在超过0.6阈值后切断。
在这个例子中,你可以看到数组是随机选择的,但是4最有可能选择在顶部附近。
In [11]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[11]: array([0, 4, 3, 2, 1])
In [12]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[12]: array([3, 4, 1, 2, 0])
In [13]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[13]: array([0, 4, 3, 1, 2])
In [14]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[14]: array([4, 3, 0, 2, 1])
In [15]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[15]: array([4, 2, 3, 0, 1])
In [16]: np.random.choice(5, 5, replace=False, p=[0.05, 0.05, 0.1, 0.2, 0.6])
Out[16]: array([3, 4, 2, 0, 1])
注意:replace = False,确保概率重新称重"从某种意义上说,它无法再次被选中。