Question

我的数据格式为0010011或[False, False, True, False, False, True, True]。这种格式大约有1000万个例子，每个格式都不是七个而是一千个。

在我的用例中，我以相同的形式获得了一个新条目。然后，我想获得100个最相同的条目的索引。在这里，大多数相等被定义为分别具有最交叉的1 s或True s。例如，两个条目00011和00010有一个相交的1。

目前，我正在进行这样的比较：

similarties = []
for idx, binary_1 in enumerate(complete_list):
    similarties += [(idx, np.binary_repr(binary_1 & binary_2).count('1'))]
similarties.sort(key=lambda t: t[1], reverse=True)

对于10000个随机测试条目，这需要0.2秒。有没有更快的方法呢？

Answer 1

更新：发现了三倍的加速。

这是一种在打包位上使用numpy来节省内存的方法。要在此格式与0类型的1和uint8之间进行转换，numpy会提供函数packbits和unpackbits。

下面的代码预先计算了可以由16位块组成的所有2^16模式的总和。

（旧版本查找了数据和模板中的字节对）

我们使用视图转换到uint64来执行64位块上的按位交集，然后转换回uint16进行查找。

要找到n最接近我们使用argpartition（O（N））而不是argsort（O（N log N））。

import numpy as np

n, m = 1_000_000, 1_000

data = np.random.randint(0, 256, (n, (m + 63) // 64 * 8), dtype=np.uint8)
test = np.random.randint(0, 256, ((m + 63) // 64 * 8,), dtype=np.uint8)

def create_lookup_1d():
    x, p = np.ogrid[:1<<16, :16]
    p = 1 << p
    return np.count_nonzero(x & p, axis=1)

lookup_1d = create_lookup_1d()

def find_closest(data, test, n):
    similarities = lookup_1d[(data.view(np.uint64) & test.view(np.uint64))
                             .view(np.uint16)].sum(axis=1)
    top_n = np.argpartition(similarities, len(data)-n)[-n:]
    return top_n, similarities[top_n]

# below is obsolete older version

def create_lookup_2d():
    x, y, p = np.ogrid[:256, :256, :8]
    p = 1 << p
    return np.count_nonzero(x & y & p, axis=2)

lookup_2d = create_lookup_2d()

def find_closest_old(data, test, n):
    similarities = lookup_2d[data, test].sum(axis=1)
    top_n = np.argpartition(similarities, len(data)-n)[-n:]
    return top_n, similarities[top_n]

演示（一百万个条目，每个千位，最好找一百个）：

>>> import time
>>> t = time.perf_counter(); find_closest(data, test, 100); t = time.perf_counter() - t
(array([913659, 727762, 819589, 810536, 671392, 573459, 197431, 642848,
         8792, 710169, 656667, 692412,  23355, 695527, 276548, 756096,
       286481, 931702, 301590, 309232, 223684, 838507, 888607, 205403,
       988198, 600239, 256834, 876452, 793813,  46501, 559521, 697295,
       948215, 247923, 503962, 808630, 515953,  22821, 614888, 487735,
       443673, 174083, 906902, 613131, 546603, 147657, 332898, 381553,
       808760, 383885, 107547,  85942,  20966, 880034, 522925,  18833,
       547674, 901503, 702596, 773050, 734658, 383581, 973043, 387713,
       645705,  27045, 230226,  77403, 906601, 507193, 828268, 175863,
       708155, 130634, 486701, 534719, 643487, 940071, 694781, 470385,
       954446, 134532, 748100, 110987, 417001, 871320, 993915, 489725,
         6509,  38345, 705618, 637435, 311252, 347282, 536091, 663643,
       830238, 376695, 896090, 823631]), array([305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305, 305,
       305, 306, 305, 306, 306, 305, 305, 305, 305, 305, 306, 305, 306,
       305, 306, 314, 308, 307, 309, 306, 308, 307, 308, 307, 312, 308,
       306, 316, 306, 306, 307, 307, 308, 309, 308, 307, 309, 309, 311,
       309, 310, 310, 307, 307, 306, 307, 306, 307, 309, 308, 309, 308,
       306, 307, 309, 306, 306, 306, 311, 306, 308, 307, 306, 306, 307,
       308, 306, 307, 310, 307, 306, 307, 309, 306, 306, 310, 313, 306,
       306, 307, 310, 306, 307, 307, 309, 311, 307]))
>>> t
0.4612512579988106

Answer 2

使用广播可能会有所帮助。例如，

import numpy as np

complete_list = np.random.randint(0, 2, (10000, 10)).astype(bool)
binary_2 = np.random.randint(0, 2, 10).astype(bool)

similarities = np.sum(complete_list & binary_2, axis=1)
idx = np.argsort(similarities)

print("Seed", binary_2)
print("Result", complete_list[idx[-1]])
print("Similarity", similarities[idx[-1]])

我无法让你的例子运行（可能是不同的python /库版本？）所以没有比较这两种方法运行任何基准测试。当然，我们的机器会有所不同，但上面的机器大约需要半毫秒。

请注意，我已根据您对预期逻辑的描述使用&而不是|。

比较二进制数据的最快方法？

2 个答案: