我有一个包含数据的S x n
数组DATA
。我有一个(S x 1)
数组ARRAY
,其整数值为<=n
。对于i
中的每一行DATA
,我想
DATA[i, ARRAY[i]:] = np.nan
这就是我现在正在做的事情
from numpy.random import poisson as poissonN
from numpy.random import uniform
import numpy as np
S = 1000
n = 8
DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
ARRAY = poissonN(1, S).reshape((-1, 1))
for i, draw in enumerate(ARRAY):
DATA[i, draw:] = np.nan
必须有一个向量化的模拟,如果S
数十万就更有效率,对吧?无论我尝试什么样的网格划分,它都无法解决 - 或者对于这种迭代方法同样缓慢。
答案 0 :(得分:2)
您可以使用NumPy broadcasting
和boolean indexing
-
DATA[ARRAY <= np.arange(DATA.shape[1])] = np.nan
<强>解释强>
让我们以S = 5
和n=4
为例,创建DATA
和ARRAY
。
In [288]: S = 5
...: n = 4
...: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
...: ARRAY = poissonN(1, S).reshape((-1, 1))
...:
In [289]: DATA
Out[289]:
array([[ 0.54235747, 0.01309313, 0.62664698, 0.92081697],
[ 0.17877576, 0.36536259, 0.91874957, 0.81924979],
[ 0.7518459 , 0.73218436, 0.99685998, 0.26435871],
[ 0.73130257, 0.77123956, 0.10437601, 0.09296549],
[ 0.804398 , 0.78675381, 0.71066382, 0.87481544]])
In [290]: ARRAY
Out[290]:
array([[1],
[1],
[0],
[2],
[1]])
现在,运行循环代码,看看会发生什么 -
In [291]: for i, draw in enumerate(ARRAY):
...: DATA[i, draw:] = np.nan
...:
In [292]: DATA
Out[292]:
array([[ 0.54235747, nan, nan, nan],
[ 0.17877576, nan, nan, nan],
[ nan, nan, nan, nan],
[ 0.73130257, 0.77123956, nan, nan],
[ 0.804398 , nan, nan, nan]])
现在,通过提出的解决方案,我们创建了一个与DATA
形状相同的布尔数组,以便将所有NaN
元素覆盖为True
,并将其作为False
同样,我们正在使用此处所示的broadcasting
-
In [293]: ARRAY <= np.arange(DATA.shape[1])
Out[293]:
array([[False, True, True, True],
[False, True, True, True],
[ True, True, True, True],
[False, False, True, True],
[False, True, True, True]], dtype=bool)
因此,使用布尔索引,我们可以将所有这些位置设置为DATA
中的NaN。让我们创建另一个随机元素实例,并用我们提出的方法测试NaN -
In [294]: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
In [295]: DATA[ARRAY <= np.arange(DATA.shape[1])] = np.nan
In [296]: DATA
Out[296]:
array([[ 0.87061908, nan, nan, nan],
[ 0.69237094, nan, nan, nan],
[ nan, nan, nan, nan],
[ 0.04257803, 0.82311917, nan, nan],
[ 0.00723291, nan, nan, nan]])
请注意非Nan值不同,因为我们重新创建了DATA。需要注意的重要一点是,我们已正确设置NaNs
。
运行时测试
In [297]: # Inputs
...: S = 1000
...: n = 8
...: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
...: ARRAY = poissonN(1, S).reshape((-1, 1))
...:
In [298]: DATAc = DATA.copy() # Make copy for testing proposed ans
In [299]: def org_app(DATA,ARRAY):
...: for i, draw in enumerate(ARRAY):
...: DATA[i, draw:] = np.nan
...:
In [301]: %timeit org_app(DATA,ARRAY)
100 loops, best of 3: 4.99 ms per loop
In [302]: %timeit DATAc[ARRAY <= np.arange(DATAc.shape[1])] = np.nan
10000 loops, best of 3: 94.1 µs per loop
In [305]: np.allclose(np.isnan(DATA),np.isnan(DATAc))
Out[305]: True