二进制numpy数组之间的快速汉明距离计算

时间:2015-09-23 02:51:59

标签: python arrays numpy cython hamming-distance

我有两个长度相同的numpy数组,包含二进制值

import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

我想尽可能快地计算它们之间的汉明距离,因为我有数以百万计的这样的距离计算。

这是一个简单但缓慢的选择(取自维基百科):

%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop

我已经提出了更快的选项,灵感来自堆栈溢出的一些答案。

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop

我想知道是否有更快的计算方法,可能使用cython?

5 个答案:

答案 0 :(得分:14)

有一个现成的numpy函数,胜过len((a != b).nonzero()[0]);)

np.count_nonzero(a!=b)

答案 1 :(得分:4)

与我平台上的np.count_nonzero(a!= b)的1.07μs相比,gmpy2.hamdist在将每个数组转换为mpz(多精度整数)后将其降低到大约143ns:

import numpy as np
from gmpy2 import mpz, hamdist, pack

a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

根据来自@casevh的提示,使用gmpy2.pack(list(reversed(list(array)))1可以合理有效地完成从1和0的数组到gmpy2 mpz对象的转换。

# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)

hamdist(ampz,bmpz)
Out[8]: 7

%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop

进行相对比较,在我的平台上:

%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop

%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop   

答案 2 :(得分:4)

使用pythran可以带来额外的好处:

$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
    return len(nonzero(a != b)[0])

作为参考(没有pythran):

$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop

在pythran编译之后:

$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop

这比numpy实现大约6x加速,因为pythran在评估元素明智比较时会跳过创建中间数组。

我还测量过:

def hamm(a,b):
    return count_nonzero(a != b)

我得到3.11 usec per loop的Python版本和0.427 usec per loop的Pythran版本。

免责声明:我是Pythran开发者之一。

答案 3 :(得分:0)

对于字符串,它工作更快

def Hamm(a, b):
    c = 0
    for i in range(a.shape[0]):
        if a[i] != b[i]:
            c += 1
    return c

答案 4 :(得分:0)

我建议您使用np.packbits将numpy位数组转换为numpy uint8数组

看看scipy的spacespace.distance.hamming: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

否则,这是一个小片段,只需要受Fast way of counting non-zero bits in positive integer启发的numpy即可:

bit_counts = np.array([int(bin(x).count("1")) for x in range(256)]).astype(np.uint8)
def hamming_dist(a,b,axis=None):
    return np.sum(bit_counts[np.bitwise_xor(a,b)],axis=axis)

随着axis = -1,这允许采用条目与大型数组之间的hammig距离;例如:

inp = np.uint8(np.random.random((512,8))*255) #512 entries of 8 byte
hd = hamming_dist(inp, inp[123], axis=-1) #results in 512 hamming distances to entry 123
idx_best = np.argmin(hd)    # should point to identity 123
hd[123] = 255 #mask out identity
idx_nearest= np.argmin(hd)    # should point entry in list with shortest distance to entry 123
dist_hist = np.bincount(np.uint8(hd)) # distribution of hamming distances; for me this started at 18bits to 44bits with a maximum at 31