我正试图找到一种更快的方法来计算两个numpy数组之间的汉明距离。可以假设阵列具有尺寸A(N1×D)和B(N2×D)
到目前为止我的工作尝试:
result = np.zeros((A.shape[0], B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
result[i, j] = np.sum(A[i, :] != B[j, :]) #resulting array is of size (1 x D)
return result
这还不够快。我尝试使用numpy.count_nonzero
而不是sum
,但它引发了以下异常:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
编辑:我忘了提一下,数组只包含1和0值,如果改变了什么
我的问题是:是否有可能使其发挥作用?
作为一个额外的问题:为什么numpy.count_nonzero
在我的代码中将数组传递给__bool()__
,而不是一个特定值?
答案 0 :(得分:0)
根据@Paul的建议,我比较了两种方法在给定numpy.ndarray
的情况下的时间消耗:
import numpy as np
import time
def binarize(FV):
return np.where(FV > 0, 1, 0).astype(int)
def hammingDist():
a, b = -1, 1
u = (b - a) * np.random.random_sample((3450, 128)) + a
v = (b - a) * np.random.random_sample((3450, 128)) + a
b_t = time.time()
b_u, b_v = binarize(u), binarize(v)
print('binarization time : {} s'.format(time.time()-b_t))
h_slow_t = time.time()
H = np.zeros((b_v.shape[0], b_u.shape[0]))
for i in range(b_v.shape[0]):
for j in range(b_u.shape[0]):
H[i, j] = np.sum(b_v[i, :] != b_u[j, :])
print('H =\n{}'.format(H))
print('t: {} s'.format(time.time()-h_slow_t))
h_f = time.time()
H_fast = np.count_nonzero(b_v[:, None, :] != b_u, axis=2)
print('H_fast =\n{}'.format(H_fast))
print('t: {} s'.format(time.time()-h_f))
if __name__ == "__main__":
hammingDist()
结果:
binarization time : 0.010922908783 s
H =
[[60. 75. 65. ... 66. 56. 66.]
[64. 57. 69. ... 78. 64. 58.]
[62. 63. 65. ... 60. 66. 68.]
...
[60. 63. 69. ... 66. 60. 64.]
[68. 59. 59. ... 52. 62. 74.]
[75. 70. 58. ... 59. 65. 65.]]
t: 53.5885431767 s
H_fast =
[[60 75 65 ... 66 56 66]
[64 57 69 ... 78 64 58]
[62 63 65 ... 60 66 68]
...
[60 63 69 ... 66 60 64]
[68 59 59 ... 52 62 74]
[75 70 58 ... 59 65 65]]
t: 2.6171131134 s
答案 1 :(得分:0)
您可以使用NumPy广播或scikit Learn自行实现。 SciKit学习是最快的。
import numpy as np
import sklearn.neighbors as sn
N1 = 345
N2 = 3450
D = 128
A = np.random.randint(0, 10, size=(N1, D))
B = np.random.randint(0, 10, size=(N2, D))
def slow(A, B):
result = np.zeros((A.shape[0], B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
result[i, j] = np.sum(A[i, :] != B[j, :]) #resulting array is of size (1 x D)
return result
def fast(A, B):
return np.count_nonzero(A[:, None, :] != B[None, :, :], axis=-1)
def sklearn(A, B):
return sn.DistanceMetric.get_metric("hamming").pairwise(A, B) * A.shape[-1]
%timeit -r1 -n1 slow(A, B)
# 7.86 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -r1 -n1 fast(A, B)
# 335 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -r1 -n1 sklearn(A, B)
# 51.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
np.allclose(slow(A, B), fast(A, B)) # True
np.allclose(fast(A, B), sklearn(A, B)) # True