Question

我正试图找到一种更快的方法来计算两个numpy数组之间的汉明距离。可以假设阵列具有尺寸A（N1×D）和B（N2×D）

到目前为止我的工作尝试：

result = np.zeros((A.shape[0], B.shape[0]))
for i in range(A.shape[0]):
    for j in range(B.shape[0]):
        result[i, j] = np.sum(A[i, :] != B[j, :]) #resulting array is of size (1 x D)
return result

这还不够快。我尝试使用numpy.count_nonzero而不是sum，但它引发了以下异常：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().

编辑：我忘了提一下，数组只包含1和0值，如果改变了什么

我的问题是：是否有可能使其发挥作用？作为一个额外的问题：为什么numpy.count_nonzero在我的代码中将数组传递给__bool()__，而不是一个特定值？

Answer 1

根据@Paul的建议，我比较了两种方法在给定numpy.ndarray的情况下的时间消耗：

import numpy as np
import time

def binarize(FV):
    return np.where(FV > 0, 1, 0).astype(int)

def hammingDist():
    a, b = -1, 1
    u = (b - a) * np.random.random_sample((3450, 128)) + a
    v = (b - a) * np.random.random_sample((3450, 128)) + a

    b_t = time.time()
    b_u, b_v = binarize(u), binarize(v)
    print('binarization time : {} s'.format(time.time()-b_t))

    h_slow_t = time.time()
    H = np.zeros((b_v.shape[0], b_u.shape[0]))
    for i in range(b_v.shape[0]):
        for j in range(b_u.shape[0]):
            H[i, j] = np.sum(b_v[i, :] != b_u[j, :])
    print('H =\n{}'.format(H))
    print('t: {} s'.format(time.time()-h_slow_t))

    h_f = time.time()
    H_fast = np.count_nonzero(b_v[:, None, :] != b_u, axis=2)
    print('H_fast =\n{}'.format(H_fast))
    print('t: {} s'.format(time.time()-h_f))

if __name__ == "__main__":
    hammingDist()

结果：

binarization time : 0.010922908783 s
H =
[[60. 75. 65. ... 66. 56. 66.]
 [64. 57. 69. ... 78. 64. 58.]
 [62. 63. 65. ... 60. 66. 68.]
 ...
 [60. 63. 69. ... 66. 60. 64.]
 [68. 59. 59. ... 52. 62. 74.]
 [75. 70. 58. ... 59. 65. 65.]]
t: 53.5885431767 s
H_fast =
[[60 75 65 ... 66 56 66]
 [64 57 69 ... 78 64 58]
 [62 63 65 ... 60 66 68]
 ...
 [60 63 69 ... 66 60 64]
 [68 59 59 ... 52 62 74]
 [75 70 58 ... 59 65 65]]
t: 2.6171131134 s

Answer 2

您可以使用NumPy广播或scikit Learn自行实现。 SciKit学习是最快的。

import numpy as np
import sklearn.neighbors as sn

N1 = 345
N2 = 3450
D = 128

A = np.random.randint(0, 10, size=(N1, D))
B = np.random.randint(0, 10, size=(N2, D))

def slow(A, B):
    result = np.zeros((A.shape[0], B.shape[0]))
    for i in range(A.shape[0]):
        for j in range(B.shape[0]):
            result[i, j] = np.sum(A[i, :] != B[j, :]) #resulting array is of size (1 x D)
    return result

def fast(A, B):
    return np.count_nonzero(A[:, None, :] != B[None, :, :], axis=-1)

def sklearn(A, B):
    return sn.DistanceMetric.get_metric("hamming").pairwise(A, B) * A.shape[-1]

%timeit -r1 -n1 slow(A, B)
# 7.86 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -r1 -n1 fast(A, B)
# 335 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -r1 -n1 sklearn(A, B)
# 51.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

np.allclose(slow(A, B), fast(A, B))  # True
np.allclose(fast(A, B), sklearn(A, B))  # True

汉明的汉明距离

2 个答案: