Question

我得到的数据如下所示。

                [column 1]   [column 2]   [column 3]   [column 4]   [column 5]
[row 1]        (some value)
[row 2]
[row 3]
...
[row 700 000]

和第二个数据集看起来完全相同，但行数较少，大约为4。我想做的是计算数据集1和2中每个数据之间的欧式距离，并找到4的最小值，如下所示：

然后对其余700000 rows个数据重复此操作。我知道不建议迭代numpy数组，因此，有什么方法可以计算出从数据集2馈入数据集1的1行中4个不同行的最小距离吗？

很抱歉，这很令人困惑，但是我要指出的是，我不希望遍历数组，而是想找到一种更好的方法解决这个问题。

最后，我应该以数据集2的4个绿色框中的最佳（最低）值获得700 000行乘1列的数据。

import numpy as np

a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] ] )

def euc_distance(array1, array2):
    return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this prints out [0 2 4]

但是，当我尝试输入多个维度时，

a = np.array([ [1,1,1,1] , [2,2,2,2] , [3,3,3,3] ])
b = np.array( [ [1,1,1,1] , [2,2,2,2] ] )

def euc_distance(array1, array2):
    return np.power(np.sum((array1 - array2)**2, axis = 1) , 0.5)
print(euc_distance(a,b))
# this throws back an error as the dimensions are not the same

我正在寻找一种将其分解为3D数组的方法，在此我可以获得[[euc_dist([1,1,1,1],[1,1,1,1]), euc_dist([1,1,1,1],[2,2,2,2])] , ... ]的数组

Answer 1

无法测试，但是假设归一化为正数，这应该可以帮助您。 np.argmax（np.matmul（a，b.T），轴= 1）

我以前的帖子很少阐述。如果仍然存在性能问题，则可以使用以下方法代替您的方法：

b = np.tile(b, (a.shape[0], 1, 1))
a = np.tile(a, (1, 1, b.shape[1])).reshape(b.shape)
absolute_dist = np.sqrt(np.sum(np.square(a - b), axis=2))

它产生的结果完全相同，但是在600,000行上的运行速度比生成器快20倍。

Answer 2

您可以为此使用广播：

a = np.array([
    [1,1,1,1],
    [2,2,2,2],
    [3,3,3,3]
])
b = np.array([
    [1,1,1,1],
    [2,2,2,2]
])

def euc_distance(array1, array2):
    return np.sqrt(np.sum((array1 - array2)**2, axis = -1))

print(euc_distance(a[None, :, :], b[:, None, :]))
# [[0. 2. 4.]
#  [2. 0. 2.]]

比较您大小的数据集的时间：

a = np.random.rand(700000, 4)
b = np.random.rand(4, 4)

c = euc_distance(a[None, :, :], b[:, None, :])
d = np.array([euc_distance(a, val) for val in b])
e = np.array([euc_distance(val, b) for val in a]).T

np.allclose(c, d)
# True
np.allclose(d, e)
# True

%timeit euc_distance(a[None, :, :], b[:, None, :])
# 113 ms ± 4.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.array([euc_distance(a, val) for val in b])
# 115 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.array([euc_distance(val, b) for val in a])
# 7.03 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 3

感谢大家的帮助，但是我认为我已经通过使用简单的列表理解解决了自己的问题。我太复杂了！这样一来，除了迭代每个数据，我基本上减少了超过一半的时间（呈指数增长）。

我所做的如下 c = np.array( [euc_distance(val, b) for val in a]) 谁知道这个问题可以有这么简单的解决方案！

python numpy数组的嵌套循环

3 个答案: