Question

我有一个 n x n numpy数组，其中包含所有成对距离和另一个 1 x n 数组包含一些评分指标。

示例：

import numpy as np
import scipy.spatial.distance

dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))

array([[  0. ,   3.2,   4.1,   8.8,   0.6],
       [  3.2,   0. ,   1.5,   9. ,   5. ],
       [  4.1,   1.5,   0. ,   9.9,  10. ],
       [  8.8,   9. ,   9.9,   0. ,   1.1],
       [  0.6,   5. ,  10. ,   1.1,   0. ]])

score = np.array([19., 1.3, 4.8, 6.2, 5.7])

array([ 19. ,   1.3,   4.8,   6.2,   5.7])

因此，请注意，得分数组的 i 元素对应于距离数组的 i 行。

我需要做的是矢量化这个过程：

对于得分数组中的 i 值，找到大于 i 值的所有其他值并记下其索引
然后，在距离数组的 i 行中，获取与上面步骤1中所述相同索引的所有距离并返回最小距离
如果得分数组中的 i 值最大，则将最小距离设置为距离数组中找到的最大距离

这是一个未矢量化的版本：

n = score.shape[0]
min_dist = np.full(n, np.max(dists))
for i in range(score.shape[0]):
    inx = numpy.where(score > score[i])
    if len(inx[0]) > 0:
        min_dist[i] = np.min(dists[i, inx])

min_dist

array([ 10. ,   1.5,   4.1,   8.8,   0.6])

这有效，但速度非常低效，我的阵列预计会更大，更大。我希望通过使用更快的矢量化操作来实现相同的结果来提高效率。

更新：根据Oliver W。的回答，我想出了我自己并不需要制作距离数组的副本

def new_method (dists, score):
    mask = score > score.reshape(-1,1)
    return np.ma.masked_array(dists, mask=~mask).min(axis=1).filled(dists.max())

理论上可以说它是一个单行，但读取未经训练的眼睛已经有点挑战了。

Answer 1

下面给出了一种可能的矢量化解决方案。

import numpy as np
import scipy.spatial.distance

dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
score = np.array([19., 1.3, 4.8, 6.2, 5.7])

def your_method(dists, score):
    dim = score.shape[0]
    min_dist = np.full(dim, np.max(dists))
    for i in range(dim):
        inx = np.where(score > score[i])
        if len(inx[0]) > 0:
            min_dist[i] = np.min(dists[i, inx])
    return min_dist

def vectorized_method_v1(dists, score):
    mask = score > score.reshape(-1,1)
    dists2 = dists.copy()  # get rid of this in case the dists array can be changed
    dists2[np.logical_not(mask)] = dists.max()
    return dists2.min(axis=1)

这些小型阵列的速度增益并不是那么令人印象深刻（在我的机器上为3倍），所以我将展示更大的一组：

dists = scipy.spatial.distance.squareform(np.random.random(50*99))
score = np.random.random(dists.shape[0])
print(dists.shape)
%timeit your_method(dists, score)
%timeit vectorized_method_v1(dists, score)

## -- End pasted text --
(100, 100)
100 loops, best of 3: 2.98 ms per loop
10000 loops, best of 3: 125 µs per loop

接近24倍。

在Numpy中为两个相关数组进行矢量化操作

1 个答案: