我使用murmur3,blake2b和Kirsch-Mitzenmacher优化创建了一个bloom过滤器,如此问题的第二个答案中所述:Which hash functions to use in a Bloom filter
然而,当我测试它时,布隆过滤器的错误率始终高于我的预期。
以下是我用于生成bloom过滤器的代码:
public class BloomFilter {
private BitSet filter;
private int size;
private int hfNum;
private int prime;
private double fp = 232000; //One false positive every fp items
public BloomFilter(int count) {
size = (int)Math.ceil(Math.ceil(((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
//size = (int)Math.ceil((hfNum * count) / Math.log(2.0));
filter = new BitSet(size);
System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}
public BloomFilter extraSecure(int count) {
return new BloomFilter(count, true);
}
private BloomFilter(int count, boolean x) {
size = (int)Math.ceil((((double)-count) * Math.log(1/fp))/(Math.pow(Math.log(2),2)));
hfNum = (int)Math.ceil(((this.size / count) * Math.log(2)));
prime = findPrime();
size = prime * hfNum;
filter = new BitSet(prime * hfNum);
System.out.println("Initialized filter with " + size + " positions and " + hfNum + " hash functions.");
}
public void add(String in) {
filter.set(getMurmur(in), true);
filter.set(getBlake(in), true);
if(this.hfNum > 2) {
for(int i = 3; i <= (hfNum); i++) {
filter.set(getHash(in, i));
}
}
}
public boolean check(String in) {
if(!filter.get(getMurmur(in)) || !filter.get(getBlake(in))) {
return false;
}
for(int i = 3; i <= hfNum; i++) {
if(!filter.get(getHash(in, i))) {
return false;
}
}
return true;
}
private int getMurmur(String in) {
int temp = murmur(in) % (size);
if(temp < 0) {
temp = temp * -1;
}
return temp;
}
private int getBlake(String in) {
int temp = new BigInteger(blake256(in), 16).intValue() % (size);
if(temp < 0) {
temp = temp * -1;
}
return temp;
}
private int getHash(String in, int i) {
int temp = ((getMurmur(in)) + (i * getBlake(in))) % size;
return temp;
}
private int findPrime() {
int temp;
int test = size;
while((test * hfNum) > size ) {
temp = test - 1;
while(!isPrime(temp)) {
temp--;
}
test = temp;
}
if((test * hfNum) < this.size) {
test++;
while(!isPrime(test)) {
test++;
}
}
return test;
}
private static boolean isPrime(int num) {
if (num < 2) return false;
if (num == 2) return true;
if (num % 2 == 0) return false;
for (int i = 3; i * i <= num; i += 2)
if (num % i == 0) return false;
return true;
}
@Override
public String toString() {
final StringBuilder buffer = new StringBuilder(size);
IntStream.range(0, size).mapToObj(i -> filter.get(i) ? '1' : '0').forEach(buffer::append);
return buffer.toString();
}
}
以下是我用来测试它的代码:
public static void main(String[] args) throws Exception {
int z = 0;
int times = 10;
while(z < times) {
z++;
System.out.print("\r");
System.out.print(z);
BloomFilter test = new BloomFilter(4000);
SecureRandom random = SecureRandom.getInstance("SHA1PRNG");
for(int i = 0; i < 4000; i++) {
test.add(blake256(Integer.toString(random.nextInt())));
}
int temp = 0;
int count = 1;
while(!test.check(blake512(Integer.toString(temp)))) {
temp = random.nextInt();
count++;
}
if(z == (times)) {
Files.write(Paths.get("counts.txt"), (Integer.toString(count)).getBytes(), StandardOpenOption.APPEND);
}else {
Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes(), StandardOpenOption.APPEND);
}
if(z == 1) {
Files.write(Paths.get("counts.txt"), (Integer.toString(count) + ",").getBytes());
}
}
}
我希望在bloom过滤器类中得到一个相对接近fp变量的值,但我经常得到一半。任何人都知道我做错了什么,或者这是否正常?
编辑:为了表明高错误率的含义,当我在使用count 4000和fp 232000初始化的过滤器上运行代码时,这是过滤器在找到之前必须运行的数量的输出误报:
158852,354114,48563,76875,156033,82506,61294,2529,82008,32624
这是使用extraSecure()方法进行初始化生成的,并重复10次以生成这10个数字;除了其中一个之外,其余所有数据都少于232000个生成值,以找出误报。 10的平均值大约是105540,无论我重复这个测试多少次,这都是常见的。
考虑到我添加了4000个数据点,看看它发现的值,在仅生成2529个数字后发现误报的事实对我来说是个大问题。
答案 0 :(得分:0)
原来问题是,另一页上的答案并不完全正确,下面的评论也没有。
评论说:
文件中的hash_i = hash1 + i x hash2%p,其中p是素数,hash1和hash2在[0,p-1]的范围内,bitset由k * p位组成。
然而,看一下这篇论文揭示了虽然所有哈希值都是mod p,但是每个哈希函数都被分配了总bitset的一个子集,我理解为hash1 mod p将确定索引0到p的值,hash2 mod p将确定索引p到2 * p的值,依此类推,直到达到为bitset选择的k值为止。
我不是100%确定添加此内容是否会修复我的代码,但值得一试。如果有效,我会更新。
更新:没有帮助。我正在调查可能导致此问题的其他原因。
答案 1 :(得分:0)
恐怕我不知道错误在哪里,但是您可以简化很多。实际上,您不需要素数,也不需要SecureRandom,BigInteger和模。您所需要的只是一个良好的64位哈希(如果可能的话,例如,杂音):
long bits = (long) (entryCount * bitsPerKey);
int arraySize = (int) ((bits + 63) / 64);
long[] data = new long[arraySize];
int k = getBestK(bitsPerKey);
void add(long key) {
long hash = hash64(key, seed);
int a = (int) (hash >>> 32);
int b = (int) hash;
for (int i = 0; i < k; i++) {
data[reduce(a, arraySize)] |= 1L << index;
a += b;
}
}
boolean mayContain(long key) {
long hash = hash64(key, seed);
int a = (int) (hash >>> 32);
int b = (int) hash;
for (int i = 0; i < k; i++) {
if ((data[reduce(a, arraySize)] & 1L << a) == 0) {
return false;
}
a += b;
}
return true;
}
static int reduce(int hash, int n) {
// http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
return (int) (((hash & 0xffffffffL) * n) >>> 32);
}
static int getBestK(double bitsPerKey) {
return Math.max(1, (int) Math.round(bitsPerKey * Math.log(2)));
}