在暨百分位数的numpy矩阵中识别包含列中位数的行

时间:2016-09-07 02:52:16

标签: python arrays numpy boolean median

考虑矩阵{"error": "Please use POST request"} ,它是形状为quantiles的3D矩阵的子集[:8,:3,0]

(10,355,8)

我想要一个与quantiles = np.array([ [ 1. , 1. , 1. ], [ 0.63763978, 0.61848863, 0.75348137], [ 0.43439645, 0.42485407, 0.5341457 ], [ 0.22682343, 0.18878366, 0.25253915], [ 0.16229408, 0.12541476, 0.15263742], [ 0.12306046, 0.10372971, 0.09832783], [ 0.09271845, 0.08209844, 0.05982584], [ 0.06363636, 0.05471266, 0.03855727]]) 矩阵形状相同的布尔输出,其中quantiles标记中位数所在的行:

True

为实现这一目标,我有以下算法:

1)确定大于In [21]: medians Out[21]: array([[False, False, False], [ True, True, False], [False, False, True], [False, False, False], [False, False, False], [False, False, False], [False, False, False], [False, False, False]], dtype=bool) 的条目:

.5

2)仅考虑In [22]: quantiles>.5 Out[22]: array([[ True, True, True], [ True, True, True], [False, False, True], [False, False, False], [False, False, False], [False, False, False], [False, False, False], [False, False, False]], dtype=bool) 操作的值子集,标记最小化条目与quantiles>.5之间np.abs距离的行。稍微折磨术语,我希望与.5np.argmin(np.abs(quantiles-.5),axis=0)的两个矩阵相交以得到上述结果。但是,我不能为我的生活找到一种方法来对子集执行quantiles>.5并保留np.argmin矩阵的形状。

PS。是的,有一个类似的问题here但是它并没有实现我的算法,我认为这可能在更大范围内更有效

2 个答案:

答案 0 :(得分:1)

进入mask中的旧Numpy操作,我找到了以下解决方案

#mask quantities that are less than .5
masked_quantiles = ma.masked_where(quantiles<.5,quantiles)

#identify the minimum in column of the masked array
median_idx = np.where(masked_quantiles == masked_quantiles.min(axis=0))

#make a matrix of all False values
median_mat = np.zeros(quantiles.shape, dtype=bool)

#assign True value to corresponding rows
In [86]: median_mat[medians] = True

In [87]: median_mat
Out[87]:
array([[False, False, False],
       [ True,  True, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

更新:比较我对​​Divakar的回答:

我进行了两次比较,一次是针对此问题提供的样本二维矩阵,一种是关于我的3D (10,380,8)数据集(不是任何大数据)。

样本数据集:

我的代码

%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)

10000 loops, best of 3: 65.1 µs per loop

Divakar的代码

%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)

The slowest run took 17.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.92 µs per loop

完整数据集

我的代码:

%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)

1000 loops, best of 3: 490 µs per loop

Divakar的代码:

%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)

10000 loops, best of 3: 172 µs per loop

结论:

Divakar的回答似乎比我快3-12倍。我认为np.ma.where掩蔽操作比矩阵添加花费的时间更长。但是,需要存储添加操作,而在较大的数据集上屏蔽可能更有效。我想知道如何比较一些不符合或几乎不符合记忆的东西。

答案 1 :(得分:1)

方法#1

这是使用broadcasting和一些屏蔽技巧的方法 -

# Mask of quantiles lesser than or equal to 0.5 to select the invalid ones
mask1 = quantiles<=0.5

# Since we are dealing with quantiles, the elems won't be > 1, 
# which can be leveraged here as we will add 1s to invalid elems, and 
# then look for argmin across each col
min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)

# Let some broadcasting magic happen here!
out = min_idx == np.arange(quantiles.shape[0])[:,None]

分步运行

1)输入:

In [37]: quantiles
Out[37]: 
array([[ 1.        ,  1.        ,  1.        ],
       [ 0.63763978,  0.61848863,  0.75348137],
       [ 0.43439645,  0.42485407,  0.5341457 ],
       [ 0.22682343,  0.18878366,  0.25253915],
       [ 0.16229408,  0.12541476,  0.15263742],
       [ 0.12306046,  0.10372971,  0.09832783],
       [ 0.09271845,  0.08209844,  0.05982584],
       [ 0.06363636,  0.05471266,  0.03855727]])

2)运行代码:

In [38]: mask1 = quantiles<=0.5
    ...: min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)
    ...: out = min_idx == np.arange(quantiles.shape[0])[:,None]
    ...: 

3)分析每一步的输出:

In [39]: mask1
Out[39]: 
array([[False, False, False],
       [False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

In [40]: np.abs(quantiles-0.5)+mask1
Out[40]: 
array([[ 0.5       ,  0.5       ,  0.5       ],
       [ 0.13763978,  0.11848863,  0.25348137],
       [ 1.06560355,  1.07514593,  0.0341457 ],
       [ 1.27317657,  1.31121634,  1.24746085],
       [ 1.33770592,  1.37458524,  1.34736258],
       [ 1.37693954,  1.39627029,  1.40167217],
       [ 1.40728155,  1.41790156,  1.44017416],
       [ 1.43636364,  1.44528734,  1.46144273]])

In [41]: (np.abs(quantiles-0.5)+mask1).argmin(0)
Out[41]: array([1, 1, 2])

In [42]: min_idx == np.arange(quantiles.shape[0])[:,None]
Out[42]: 
array([[False, False, False],
       [ True,  True, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

提升绩效:在评论之后,似乎得到了min_idx,我们可以这样做:

min_idx = (quantiles+mask1).argmin(0)

方法#2

这主要关注内存效率。

# Mask of quantiles greater than 0.5 to select the valid ones
mask = quantiles>0.5

# Select valid elems
vals = quantiles.T[mask.T]

# Get vald count per col
count = mask.sum(0)

# Get the min val per col given the mask
minval = np.minimum.reduceat(vals,np.append(0,count[:-1].cumsum()))

# Get final boolean array by just comparing the min vals across each col
out = np.isclose(quantiles,minval)