Question

我正在使用bloom过滤器来检查集合中的重复数据。但是，需要将两组数据的结果组合成一个过滤器，以检查两组中的重复。我在伪Python中设计了一个函数来执行这个任务：

def combine(a : bloom_filter, b : bloom_filter):
    assert a.length == b.length
    assert a.hashes == b.hashes

    c = new bloom_filter(length = a.length, hashes = b.hashes)
    c.attempts = a.attempts + b.attempts
    c.bits = a.bits | b.bits

    # Determining the amount of items
    a_and_b = count(a & b)
    a_not_b = count(a & !b)
    not_a_b = count(!a & b)
    neither = count(!a & !b)
    c.item_count = a_not_b / a.length * a.item_count
                 + not_a_b / b.length * b.item_count
                 + a_and_b / c.length * min(a.item_count, b.item_count)

    return c

这听起来是否正确正确？关于源数据的大部分信息是否丢失（这是布隆过滤器的重点），我正在进行相当多的内部辩论，关于是否有可能做我想做的事情。

Answer 1

您可以推导出用于估算布隆过滤器项目数量的公式：

c = log(z / N) / ((h * log(1 - 1 / N))

N: Number of bits in the bit vector
h: Number of hashes
z: Number of zero bits in the bit vector

这提供了对Bloom Filter中项目数量的相当准确的估计。您可以通过简单的减法得出贡献估计值。

Answer 2

有可能.....某种......

假设集合A包含苹果和橙子

让我们说集合B包含豌豆和胡萝卜

构造一个简单的16位布隆过滤器作为示例，CRC32作为哈希

crc32(apples) = 0x70CCB02F

crc32(oranges) = 0x45CDF3B4

crc32(peas) = 0xB18D0C2B

crc32(carrots) = 0x676A9E28

为两组（A，B）

启动w / empty bloom filter（BF）（比如16位）

BFA = BFB = 0000 0000 0000 0000

然后，将哈希分解为一些比特长度，我们将在这里使用4 我们可以把苹果添加到BF。例如

Get Apples BF Index list by splitting up the hash:

0x70CCB02F = 0111 0000 1100 1100 1011 0000 0010 1111
             7      0    C    C   B     0    2     F     
----------------------------------------------------

Add Apples to BFA by setting BF bit indexes [ 7, 0, 12, 12, 11, 0, 2, 15]

                                 (set the index bit of an empty BF to 1)
Apples =     1001 1000 1000 0101 (<- see indexes 0,2,7,11,12,15 are set)
BF =         0000 0000 0000 0000  (or operation adds that item to the BF)
================================
Updated BFA = 1001 1000 1000 0101

以相同的方式将橙子添加到BF：

0x45CDF3B4 = 0100 0101 1100 1101 1111 0011 1011 0100
              4    5    12   13   15    3   11   4
----------------------------------------------------
Add oranges to BF by setting BF bit indexes [ 4,5,12,13,15,3,11,4]

Oranges =      1011 1000 0011 1000 
BFA =          1001 1000 1000 0101  (or operation)
================================
Updated BFA =  1011 1000 1011 1101

所以现在苹果和橙子被插入到BF1中 w / 1011 1000 1011 1101

的最终价值

对BFB执行相同的操作

crc32(peas) = 0xB18D0C2B becomes => 
set [11,2,12,0,13,1,8] in BFB
 0011 1001 0000 0011 = BF(peas)

crc32(carrots) = 0x676A9E28 becomes => 
set [8,2,14,9,10,6,7] in BFB

0100 0111 1100 0100 = BF(carrots)

so BFB = 
0011 1001 0000 0011  BF(peas)
0100 0111 1100 0100  BF(carrots)
===================  ('add' them to BFB via locial or op)
0111 1111 1100 0111

你现在可以在循环中搜索B的A条目，反之亦然：

B是否包含“oranges”=＆gt;

 1011 1000 0011 1000 (Oranges BF representation)
 0111 1111 1100 0111 (BFB)
=====================     (and operation)
 0011 1000 0000 0000

因为此结果(0011 1000 0000 0000)与\ n匹配橙子原味BF，你可以肯定B不含任何橙子

......（为剩下的项目做）

以下，B不包含任何A项，就像B不包含任何苹果一样。

我不认为这就是你所问的，看起来你可以计算机的不同之处 BF，这更符合你的观点。看起来你可以做一个xor操作，这会给你一个包含差异的“单个”数组：

0111 1111 1100 0111 (BFB)
1011 1000 1011 1101 (BFA)
========================
1100 0111 0111 1010 (BFA xor BFB) == (items in B not in A, and items in A not in B)

这个单一BF的意思是，您可以100％的时间检测到项目的不存在，只是没有100％的项目存在。

您使用它的方式如下（检查A中是否缺少豌豆）：

 1100 0111 0111 1010 (BFA xor BFB)
 0011 1001 0000 0011 (Peas)
============================== (And operation)
 0000 0001 0000 0010 (non-zero)

因为(BFA xor BFB) && (Peas) != 0你知道一套不包含'豌豆'......

再次，您将逐项测试，也许您可以进行聚合，但可能不是一个好主意......

希望这有帮助！

结合布隆过滤器

2 个答案: