Question

如果您不关心我尝试实施的细节，只需跳过较低的水平线

我正在尝试使用NumPy对某些统计信息进行引导错误估计。我有一个数组x，并且希望计算统计量f(x)上的错误，而错误分析中常见的高斯假设并不成立。 x非常大。

为此，我使用x重新取样numpy.random.choice()，其中我的重新采样的大小是原始数组的大小，替换为：

resample = np.random.choice(x, size=len(x), replace=True)

这让我对x有了新的认识。此操作现在必须重复约1,000次，以提供准确的误差估计。如果我生成1,000个这种性质的重新样本;

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

然后计算每个实现的统计信息f(x);

results = [f(arr) for arr in resamples]

然后我推断出f(x)的错误就像是

np.std(results)

这个想法是即使使用高斯误差分析无法描述f(x) 本身，但f(x)测量的分布受到随机错误的影响是的。

好的，这是一个引导程序。现在，我的问题是该行

resamples = [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]

对于大型数组，

非常慢。没有列表理解，有更聪明的方法吗？第二个列表理解

results = [f(arr) for arr in resamples]

也可能很慢，具体取决于函数f(x)的详细信息。

Answer 1

由于我们允许重复，我们可以使用np.random.randint一次生成所有索引，然后简单地索引以获得等效的resamples，如此 -

num_samples = 1000
idx = np.random.randint(0,len(x),size=(num_samples,len(x)))
resamples_arr = x[idx]

另一种方法是使用numpy.random.rand从均匀分布生成随机数并按比例缩放到数组长度，如此 -

resamples_arr = x[(np.random.rand(num_samples,len(x))*len(x)).astype(int)]

x 5000元素的运行时测试 -

In [221]: x = np.random.randint(0,10000,(5000))

# Original soln
In [222]: %timeit [np.random.choice(x, size=len(x), replace=True) for i in range(1000)]
10 loops, best of 3: 84 ms per loop

# Proposed soln-1
In [223]: %timeit x[np.random.randint(0,len(x),size=(1000,len(x)))]
10 loops, best of 3: 76.2 ms per loop

# Proposed soln-2
In [224]: %timeit x[(np.random.rand(1000,len(x))*len(x)).astype(int)]
10 loops, best of 3: 59.7 ms per loop

适用于非常大的x

对于x元素的非常大的数组600,000，您可能不希望为1000个样本创建所有这些索引。在这种情况下，每个样本解决方案的时间都是这样的 -

In [234]: x = np.random.randint(0,10000,(600000))

# Original soln
In [235]: %timeit np.random.choice(x, size=len(x), replace=True)
100 loops, best of 3: 13 ms per loop

# Proposed soln-1
In [238]: %timeit x[np.random.randint(0,len(x),len(x))]
100 loops, best of 3: 12.5 ms per loop

# Proposed soln-2
In [239]: %timeit x[(np.random.rand(len(x))*len(x)).astype(int)]
100 loops, best of 3: 9.81 ms per loop

Answer 2

正如@Divakar所提到的，您可以将元组传递给size以获取重新采样的二维数组，而不是使用列表推导。

这里假设f只是总和而不是其他一些函数。然后：

x = np.random.randn(100000)
resamples = np.random.choice(x, size=(1000, x.shape[0]), replace=True)
# resamples.shape = (1000, 1000000)
results = np.apply_along_axis(f, axis=1, arr=resamples)
print(results.shape)
# (1000,)

这里np.apply_along_axis只是一个相当于[f(arr) for arr in resamples]的美化for循环。但我不确定您是否需要根据您的问题在此处对x进行索引。

使用NumPy多次对大型数组进行采样的有效方法？

2 个答案: