通用哈希的预期范围

时间:2018-06-21 11:58:55

标签: hash language-agnostic

我确定这是一个简单的问题,但是我看不到明显的解决方案...如果我有一个带有 m 箱的哈希表,并将该哈希表放入此 n < m 个键,那么没有bin接收到超过 k 个哈希键的概率是多少?我试图弄清楚如果我填满一个表来加载 n / m 然后重新哈希直到不超过 k 在任何bin中发生冲突(显然是 k > n / m )。

1 个答案:

答案 0 :(得分:1)

在均匀分布的情况下,这与将球扔到垃圾箱中是一样的,{。{3}}中的研究由M. Raab和A. Steger进行。

这与"Balls into Bins - A Simple and Tight Analysis"有点相关,但是在这里您只使用一个哈希函数。

因为这是stackoverflow.com,所以我为您提供了一个可用于验证公式的模拟程序。据此,它还取决于球/桶的数量,而不仅取决于每个桶的平均球数量。

public static void main(String... args) throws InterruptedException {
    for (int k = 1; k < 4; k++) {
        test(10, 30, k);
        test(100, 300, k);
    }
}

public static void test(int ballCount, int binCount, int k) {
    int rehashCount = 0;
    Random r = new Random(1);
    int testCount = 100000000 / ballCount;
    for(int test = 0; test < testCount; test++) {
        long[] balls = new long[ballCount];
        int[] bins = new int[binCount];
        for (int i = 0; i < ballCount; i++) {
            balls[i] = r.nextLong();
        }
        // it's very unlikely to get duplicates, but test
        Arrays.sort(balls);
        for (int i = 1; i < ballCount; i++) {
            if (balls[i - 1] == balls[i]) {
                throw new AssertionError();
            }
        }
        int universalHashId = 0;
        boolean rehashNeeded = false;
        for (int i = 0; i < ballCount; i++) {
            long x = balls[i];
            // might as well do y = x
            long y = supplementalHashWeyl(x, universalHashId);
            int binId = reduce((int) y, binCount);
            if (++bins[binId] > k) {
                rehashNeeded = true;
                break;
            }
        }
        if (rehashNeeded) {
            rehashCount++;
        }
    }
    System.out.println("balls: " + ballCount + " bins: " + binCount +
            " k: " + k + " rehash probability: " + (double) rehashCount / testCount);
}

public static int reduce(int hash, int n) {
    // http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
    return (int) (((hash & 0xffffffffL) * n) >>> 32);
}

public static int supplementalHashWeyl(long hash, long index) {
    long x = hash + (index * 0xbf58476d1ce4e5b9L);
    x = (x ^ (x >>> 32)) * 0xbf58476d1ce4e5b9L;
    x = ((x >>> 32) ^ x);
    return (int) x;
}

输出:

balls: 10 bins: 30 k: 1 rehash probability: 0.8153816
balls: 100 bins: 300 k: 1 rehash probability: 1.0
balls: 10 bins: 30 k: 2 rehash probability: 0.1098305
balls: 100 bins: 300 k: 2 rehash probability: 0.777381
balls: 10 bins: 30 k: 3 rehash probability: 0.0066018
balls: 100 bins: 300 k: 3 rehash probability: 0.107309