布隆过滤器:获得比预期更高的错误率

时间:2018-03-18 23:27:14

标签: java filter logic bloom-filter

我使用murmur3,blake2b和Kirsch-Mitzenmacher优化创建了一个bloom过滤器,如此问题的第二个答案中所述:Which hash functions to use in a Bloom filter

然而,当我测试它时,布隆过滤器的错误率始终高于我的预期。

以下是我用于生成bloom过滤器的代码:

public class BloomFilter {
private BitSet filter;
private int size;
private int hfNum;
private int prime;
private double fp = 232000; //One false positive every fp items

public BloomFilter(int count) {
    size = (int)Math.ceil(Math.ceil(((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
    hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
    //size = (int)Math.ceil((hfNum * count) / Math.log(2.0));
    filter = new BitSet(size);

    System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}

public BloomFilter extraSecure(int count) {
    return new BloomFilter(count, true);
}

private BloomFilter(int count, boolean x) {
    size = (int)Math.ceil((((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
    hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
    prime = findPrime();
    size = prime * hfNum;
    filter = new BitSet(prime * hfNum);

    System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}

public void add(String in) {
    filter.set(getMurmur(in), true);
    filter.set(getBlake(in), true);

    if(this.hfNum > 2) {
        for(int i = 3; i <= (hfNum); i++) {
            filter.set(getHash(in, i));
        }
    }
}

public boolean check(String in) {
    if(!filter.get(getMurmur(in)) || !filter.get(getBlake(in))) {
        return false;
    }

    for(int i = 3; i <= hfNum; i++) {
        if(!filter.get(getHash(in, i))) {
            return false;
        }
    }

    return true;
}

private int getMurmur(String in) {
    int temp = murmur(in) % (size);

    if(temp < 0) {
        temp = temp * -1;
    }

    return temp;
}

private int getBlake(String in) {
    int temp = new BigInteger(blake256(in), 16).intValue() % (size);

    if(temp < 0) {
        temp = temp * -1;
    }

    return temp;
}

private int getHash(String in, int i) {
    int temp = ((getMurmur(in)) + (i * getBlake(in))) % size;
    return temp;
}

private int findPrime() {
    int temp;

    int test = size;
    while((test * hfNum) > size ) {
        temp = test - 1;
        while(!isPrime(temp)) {
            temp--;
        }
        test = temp;
    }

    if((test * hfNum) < this.size) {
        test++;
        while(!isPrime(test)) {
            test++;
        }
    }

    return test;
}

private static boolean isPrime(int num) {
    if (num < 2) return false;
    if (num == 2) return true;
    if (num % 2 == 0) return false;
    for (int i = 3; i * i <= num; i += 2)
        if (num % i == 0) return false;
    return true;
}

@Override
public String toString() {
    final StringBuilder buffer = new StringBuilder(size);
    IntStream.range(0, size).mapToObj(i -> filter.get(i) ? '1' : '0').forEach(buffer::append);
    return buffer.toString();
}

}

以下是我用来测试它的代码:

public static void main(String[] args) throws Exception {
    int z = 0;
    int times = 10;
    while(z < times) {
        z++;
        System.out.print("\r");
        System.out.print(z);


        BloomFilter test = new BloomFilter(4000);

        SecureRandom random = SecureRandom.getInstance("SHA1PRNG");
        for(int i = 0; i < 4000; i++) {
            test.add(blake256(Integer.toString(random.nextInt())));
        }

        int temp = 0;
        int count = 1;
        while(!test.check(blake512(Integer.toString(temp)))) {
            temp = random.nextInt();
            count++;
        }

        if(z == (times)) {
            Files.write(Paths.get("counts.txt"), (Integer.toString(count)).getBytes(), StandardOpenOption.APPEND);
        }else {
            Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes(), StandardOpenOption.APPEND);
        }

        if(z == 1) {
            Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes());
        }

    }
}

我希望在bloom过滤器类中得到一个相对接近fp变量的值,但我经常得到一半。任何人都知道我做错了什么,或者这是否正常?

编辑:为了表明高错误率的含义,当我在使用count 4000和fp 232000初始化的过滤器上运行代码时,这是过滤器在找到之前必须运行的数量的输出误报:

158852,354114,48563,76875,156033,82506,61294,2529,82008,32624

这是使用extraSecure()方法进行初始化生成的,并重复10次以生成这10个数字;除了其中一个之外,其余所有数据都少于232000个生成值,以找出误报。 10的平均值大约是105540,无论我重复这个测试多少次,这都是常见的。

考虑到我添加了4000个数据点,看看它发现的值,在仅生成2529个数字后发现误报的事实对我来说是个大问题。

2 个答案:

答案 0 :(得分:0)

原来问题是,另一页上的答案并不完全正确,下面的评论也没有。

评论说:

  文件中的

hash_i = hash1 + i x hash2%p,其中p是素数,hash1和hash2在[0,p-1]的范围内,bitset由k * p位组成。

然而,看一下这篇论文揭示了虽然所有哈希值都是mod p,但是每个哈希函数都被分配了总bitset的一个子集,我理解为hash1 mod p将确定索引0到p的值,hash2 mod p将确定索引p到2 * p的值,依此类推,直到达到为bitset选择的k值为止。

我不是100%确定添加此内容是否会修复我的代码,但值得一试。如果有效,我会更新。

更新:没有帮助。我正在调查可能导致此问题的其他原因。

答案 1 :(得分:0)

恐怕我不知道错误在哪里,但是您可以简化很多。实际上,您不需要素数,也不需要SecureRandom,BigInteger和模。您所需要的只是一个良好的64位哈希(如果可能的话,例如,杂音):

long bits = (long) (entryCount * bitsPerKey);
int arraySize = (int) ((bits + 63) / 64);
long[] data = new long[arraySize];
int k = getBestK(bitsPerKey);

void add(long key) {
    long hash = hash64(key, seed);
    int a = (int) (hash >>> 32);
    int b = (int) hash;
    for (int i = 0; i < k; i++) {
        data[reduce(a, arraySize)] |= 1L << index;
        a += b;
    }
}

boolean mayContain(long key) {
    long hash = hash64(key, seed);
    int a = (int) (hash >>> 32);
    int b = (int) hash;
    for (int i = 0; i < k; i++) {
        if ((data[reduce(a, arraySize)] & 1L << a) == 0) {
            return false;
        }
        a += b;
    }
    return true;
}

static int reduce(int hash, int n) {
    // http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
    return (int) (((hash & 0xffffffffL) * n) >>> 32);
}

static int getBestK(double bitsPerKey) {
    return Math.max(1, (int) Math.round(bitsPerKey * Math.log(2)));
}