基于数组的矢量化矩阵选择

时间:2016-06-14 20:01:02

标签: python performance numpy scipy vectorization

我有一个包含数据的S x n数组DATA。我有一个(S x 1)数组ARRAY,其整数值为<=n。对于i中的每一行DATA,我想

DATA[i, ARRAY[i]:] = np.nan

这就是我现在正在做的事情

from numpy.random import poisson as poissonN
from numpy.random import uniform
import numpy as np
S = 1000
n = 8
DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
ARRAY = poissonN(1, S).reshape((-1, 1))

for i, draw in enumerate(ARRAY):
    DATA[i, draw:] = np.nan

必须有一个向量化的模拟,如果S数十万就更有效率,对吧?无论我尝试什么样的网格划分,它都无法解决 - 或者对于这种迭代方法同样缓慢。

1 个答案:

答案 0 :(得分:2)

您可以使用NumPy broadcastingboolean indexing -

DATA[ARRAY <= np.arange(DATA.shape[1])] = np.nan

<强>解释

让我们以S = 5n=4为例,创建DATAARRAY

In [288]: S = 5
     ...: n = 4
     ...: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
     ...: ARRAY = poissonN(1, S).reshape((-1, 1))
     ...: 

In [289]: DATA
Out[289]: 
array([[ 0.54235747,  0.01309313,  0.62664698,  0.92081697],
       [ 0.17877576,  0.36536259,  0.91874957,  0.81924979],
       [ 0.7518459 ,  0.73218436,  0.99685998,  0.26435871],
       [ 0.73130257,  0.77123956,  0.10437601,  0.09296549],
       [ 0.804398  ,  0.78675381,  0.71066382,  0.87481544]])

In [290]: ARRAY
Out[290]: 
array([[1],
       [1],
       [0],
       [2],
       [1]])

现在,运行循环代码,看看会发生什么 -

In [291]: for i, draw in enumerate(ARRAY):
     ...:     DATA[i, draw:] = np.nan
     ...:     

In [292]: DATA
Out[292]: 
array([[ 0.54235747,         nan,         nan,         nan],
       [ 0.17877576,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan],
       [ 0.73130257,  0.77123956,         nan,         nan],
       [ 0.804398  ,         nan,         nan,         nan]])

现在,通过提出的解决方案,我们创建了一个与DATA形状相同的布尔数组,以便将所有NaN元素覆盖为True,并将其作为False

同样,我们正在使用此处所示的broadcasting -

In [293]: ARRAY <= np.arange(DATA.shape[1])
Out[293]: 
array([[False,  True,  True,  True],
       [False,  True,  True,  True],
       [ True,  True,  True,  True],
       [False, False,  True,  True],
       [False,  True,  True,  True]], dtype=bool)

因此,使用布尔索引,我们可以将所有这些位置设置为DATA中的NaN。让我们创建另一个随机元素实例,并用我们提出的方法测试NaN -

In [294]: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))

In [295]: DATA[ARRAY <= np.arange(DATA.shape[1])] = np.nan

In [296]: DATA
Out[296]: 
array([[ 0.87061908,         nan,         nan,         nan],
       [ 0.69237094,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan],
       [ 0.04257803,  0.82311917,         nan,         nan],
       [ 0.00723291,         nan,         nan,         nan]])

请注意非Nan值不同,因为我们重新创建了DATA。需要注意的重要一点是,我们已正确设置NaNs

运行时测试

In [297]: # Inputs
     ...: S = 1000
     ...: n = 8
     ...: DATA = uniform(low=0, high=1, size=S*n).reshape((S, n))
     ...: ARRAY = poissonN(1, S).reshape((-1, 1))
     ...: 

In [298]: DATAc = DATA.copy() # Make copy for testing proposed ans

In [299]: def org_app(DATA,ARRAY):
     ...:     for i, draw in enumerate(ARRAY):
     ...:         DATA[i, draw:] = np.nan
     ...:         

In [301]: %timeit org_app(DATA,ARRAY)
100 loops, best of 3: 4.99 ms per loop

In [302]: %timeit DATAc[ARRAY <= np.arange(DATAc.shape[1])] = np.nan
10000 loops, best of 3: 94.1 µs per loop

In [305]: np.allclose(np.isnan(DATA),np.isnan(DATAc))
Out[305]: True