Question

我遇到以下问题：

给出一组无序的，任意大的id（例如32位空间中的1,2,5,6,8）
在更大的空间（例如64位）中计算哈希码。

简单的方法是为每个ID计算哈希函数，然后将所有内容异或。但是，如果ID为32位空间，散列函数为64位空间，则可能不是解决此问题的最佳方法（碰撞等等）。

我一直在考虑使用Murmur3终结器，然后将结果与XOR结合在一起，但我想这也是因为同样的原因而无法工作（我不确定说实话）。同样，简单地乘以值也应该有效（因为 b = b a），但我不确定如何“好”＆＃39;哈希函数将是。

显然，我会想到对ID进行排序，之后Murmur3会做得很好。不过，如果可以避免，我也不想排序。

这种哈希函数的优秀算法是什么？

更新

好吧，我想我可能有点混乱。

关于Why is XOR the default way to combine hashes?的第二个答案实际上解释了关于组合散列函数。在那里呈现的情况下，XOR被认为是一个糟糕的哈希函数，因为＆＃34; dab＆＃34;生成与＆＃34; abd＆＃34;相同的代码。在我的情况下，我希望这些东西生成相同的哈希值 - 但我也希望最小化-say-＆＃34; abc＆＃34;也生成与-say-＆＃34; abd＆＃34;相同的哈希值。

大多数哈希函数的全部目的是，如果您提供数据，它们很有可能使用完整的密钥空间。通常，这些散列函数利用了数据顺序的事实，并且乘以大数字来混淆位。所以简单来说就是：

var hash = SomeInitialConstant;
foreach (var id in ids) {
  hash = hash * SomeConstant + hashCode(id);
}
// ... optionally shuffle bits around as finalizer
return hash;

现在，如果ID始终处于相同的顺序，这样可以正常工作。但是，如果ID无序，则无法正常工作，因为x * constant + y不可交换。

如果你对ID进行定位，我认为你最终不会使用整个哈希空间。考虑如果你有大数字，比如100000,100001等会发生什么。那些是10000000000,10000200001等等。你不可能得到一个正方形来产生一个像900000这样的数字（只是因为sqrt（900000）是一个带分数的数字）。

更一般地说，10000000000和10000200001之间的所有哈希空间都可能会丢失。但是，-say-0和10之间的空间会产生很多冲突，因为小数字的正方形之间的可用哈希空间也很小。

使用大密钥空间的整个目的显然是几乎没有碰撞。我希望有一个相当大的哈希空间（比方说，256位），以确保在现实生活场景中几乎不存在冲突。

Answer 1

我刚检查过：

使用32位哈希
在64K数组表中
64K项目（负载系数= 100％）
的8位值（无符号字符）
（数组大小4 ... 64）
哈希函数：= cnt +（sum cube（arr [i]））
或：= sum（square（zobrist [arr [i]））
Zobrist工作得更好，（但阵列需要随机化）
并且碰撞不超过预期的最佳散列函数。
为了避免重新计算（时空权衡），我实际上存储对象中的哈希值
因为碰撞是生活中的事实，你可以将排序推迟到你真正需要它的那一刻用于最终比较（当链长开始增长超过1时）

#include <stdio.h>
#include <stdlib.h>

struct list {
        struct list *next;
        unsigned hash;
        unsigned short cnt;
        unsigned char *data;
        };

struct list *hashtab[1<<16] = {NULL, };
#define COUNTOF(a) (sizeof a / sizeof a[0])
unsigned zobrist[256] = {0,};
/*************************/
unsigned hash_it(unsigned char *cp, unsigned cnt)
{
unsigned idx;
unsigned long long hash = 0;

for(idx=0; idx < cnt; idx++) {
#if 0   /* cube */
        hash += (cp[idx] * cp[idx] * cp[idx]);
#else
        unsigned val;
        val = zobrist[cp[idx]];
        hash += (val * val);
#endif
        }
#if 0   /* as a tie-breaker: add the count (this avoids pythagorean triplets but *not* taxi-numbers) */
hash += cnt;
#endif
return hash;
}
/*************************/
struct list *list_new(unsigned cnt){
struct list *p;
unsigned idx;

p = malloc( sizeof *p + cnt);
p->data = (unsigned char*)(p+1);
p->cnt = cnt;
p->next = NULL;

for(idx=0; idx < cnt; idx++) {
        p->data[idx] = 0xff & rand();
        }
p->hash = hash_it(p->data, p->cnt);
return p;
}
/*************************/
void do_insert(struct list *this)
{
struct list **pp;
unsigned slot;

slot  = this->hash % COUNTOF(hashtab);
for (pp = &hashtab[slot]; *pp; pp = &(*pp)->next) {;}
*pp = this;
}
/*************************/
void list_print(struct list *this)
{
unsigned idx;
if (!this) return;

printf("%lx data[%u] = ", (unsigned long) this->hash, this->cnt);

for (idx=0; idx < this->cnt; idx++) {
        printf("%c%u"
        , idx ? ',' : '{' , (unsigned int) this->data[idx] );
        }
printf("}\n" );
}
/*************************/
unsigned list_cnt(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) { cnt++; }
return cnt;
}
/*************************/
unsigned list_cnt_collisions(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) {
        struct list *that;
        for(that=this->next; that; that=that->next) {
                if (that->cnt != this->cnt) continue;
                if (that->hash == this->hash) cnt++;
                }
        }
return cnt;
}
/*************************/
int main(void)
{
unsigned idx, val;
struct list *p;
unsigned hist[300] = {0,};

        /* NOTE: you need a better_than_default random generator
        ** , the zobrist array should **not** contain any duplicates
        */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] = val;
        }

        /* a second pass will increase the randomness ... just a bit ... */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] ^= val;
        }
        /* load-factor = 100 % */
for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        do {
          val = random();
          val %= 0x40;
        } while(val < 4); /* array size 4..63 */
        p = list_new(val);
        do_insert(p);
        }

for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        val = list_cnt( hashtab[idx]);
        hist[val] += 1;
        val = list_cnt_collisions(hashtab[idx]);
        if (!val) continue;
        printf("[%u] : %u\n", idx, val);
        for (val=0,p = hashtab[idx]; p; p= p->next) {
                printf("[%u]: ", val++);
                list_print(p);
                }
        }

for (idx = 0; idx < COUNTOF(hist); idx++) {
        if (!hist[idx]) continue;
        printf("[%u] = %u\n", idx, hist[idx]);
        }

return 0;
}
/*************************/

输出直方图（链长，0：=空槽）：

$ ./a.out
[0] = 24192
[1] = 23972
[2] = 12043
[3] = 4107
[4] = 1001
[5] = 181
[6] = 34
[7] = 4
[8] = 2

最后的注释：取代Zobrist []的平方和，你也可以将它们混合在一起（假设条目是唯一的）

额外的最后注释：C stdlib rand()函数可能无法使用。 RAND_MAX可能只有15位：0x7fff（32767）。要填充zobrist表，您需要更多值。这可以通过将一些额外的(rand() << shift)与更高位进行异或来完成。

新结果，使用（来自）一个非常大的源域（32个元素* 8位），将其散列到32位散列键，插入到1<<20个插槽的散列表中。

Number of elements 1048576 number of slots 1048576
Element size = 8bits, Min setsize=0, max set size=32
(using Cubes, plus adding size) Histogram of chain lengths:
[0] = 386124 (0.36824)
[1] = 385263 (0.36742)
[2] = 192884 (0.18395)
[3] = 64340 (0.06136)
[4] = 16058 (0.01531)
[5] = 3245 (0.00309)
[6] = 575 (0.00055)
[7] = 78 (0.00007)
[8] = 9 (0.00001)

非常接近达到最佳状态;对于100％加载的哈希表，直方图中的前两个条目应该相等，在完美的情况下，都是1 / e。前两个条目是空插槽和只有一个元素的插槽。

Answer 2

在我的情况下，我希望这些东西生成相同的哈希值 - 但我也希望最小化-say-“abc”也生成与-say-“abd”相同的哈希值的机会。

Bitwise-XOR实际上保证：如果两个相同大小的集合除了一个元素之外是相同的，那么它们必然会有不同的按位异或。（顺便提一下，对于环绕式求和也是如此：如果两个相同大小的集合除了一个元素之外是相同的，那么它们必然会有不同的总和 - 包围。）

因此，如果您对底部32位使用按位XOR，那么您基本上有32个“额外”位来尝试进一步减少冲突：减少两组不同大小具有相同校验和的情况，或者两个情况下由两个或更多元素区分的集合具有相同的校验和。一种相对简单的方法是选择一个从32位整数映射到32位整数的函数 f ，然后将bitwise-XOR应用于应用 f 的结果每个元素。你想要的主要内容 f ：

它应该便宜且易于实施。
它应该将零映射到非零值（以便{1,2,3}和{0,1,2,3}具有不同的校验和）。
映射不应涉及以恒定方式重组位（例如，位移），因为reorganize_bits（ a ）XOR reorganize_bits（ b ）等同于reorganize_bits（ a XOR b ），因此它不会向校验和添加任何独立信息。
出于同样的原因，映射不应该与常量进行异或。

以上，joop建议 f （ a ）= a ² MOD 2 ³²，这对我来说似乎不错，除了零问题。也许 f （ a ）=（ a + 1）² MOD 2 ³² ？

Answer 3

这个答案只是为了完整性。

从@joop的解决方案中，我注意到他使用的比特比我少。此外，他还建议使用x ^ 3而不是x ^ 2，这会产生巨大的差异。

在我的代码中，我使用8位id进行测试，因为产生了很小的密钥空间。这意味着我们可以简单地测试长度高达4或5个id的所有链条。哈希空间是32位。（C＃）代码非常简单：

static void Main(string[] args)
{
    for (int index = 0; index < 256; ++index)
    {
        CreateHashChain(index, 4, 0);
    }

    // Create collision histogram:
    Dictionary<int, int> histogram = new Dictionary<int, int>();
    foreach (var item in collisions)
    {
        int val;
        histogram.TryGetValue(item.Value, out val);
        histogram[item.Value] = val + 1;
    }

    foreach (var item in histogram.OrderBy((a) => a.Key))
    {
        Console.WriteLine("{0}: {1}", item.Key, item.Value);
    }
    Console.ReadLine();
}

private static void CreateHashChain(int index, int size, uint code)
{
    uint current = (uint)index;

    // hash
    uint v = current * current;
    code = code ^ v;

    // recurse for the rest of the chain:
    if (size == 1)
    {
        int val;
        collisions.TryGetValue(code, out val);
        collisions[code] = val + 1;
    }
    else
    {
        for (int i = index + 1; i < 256 - size; ++i)
        {
            CreateHashChain(i, size - 1, code);
        }
    }
}

private static Dictionary<uint, int> collisions = new Dictionary<uint, int>();

现在，这就是哈希函数。我会写下我尝试过的一些事情：

<强> X ^ 2

代码：

// hash
uint v = current * current;
code = code ^ v;

结果：很多很多很多碰撞。事实上，没有一个不会碰撞不到3612次的情况。显然我们只使用16位，所以可以解释得很好。无论如何，结果是非常糟糕。

<强>的x ^ 3

代码：

// hash
uint v = current * current * current;
code = code ^ v;

结果：

还是很糟糕，但同样，我们只使用了24位的密钥空间，因此必然会发生冲突。而且，它比使用x ^ 2要好得多。

<强> X ^ 4

代码：

// hash
uint v = current * current;
v = v * v;
code = code ^ v;

结果：

1: 118795055
2: 20402127
3: 2740658
4: 329621
5: 38453
6: 4420
7: 495
8: 47
9: 12

正如预期的那样，这要好得多，显然这是因为我们现在正在使用完整的32位。

介绍y

引入更大密钥空间的另一种方法是引入另一个变量-say- y，它是x的函数。这背后的想法是x^n x的小值将导致数量较小，因此碰撞的可能性很高;我们可以通过确保y如果x很小并且进行位运算来组合两个散列函数来抵消这一点。最简单的方法是为所有位引起位翻转：

// hash
uint x = current;
uint y = (255 ^ current);

uint v1 = (UInt16)(x * x * x);
uint v2 = (UInt16)(y * y * y);
code = code ^ v1 ^ (v2 << 16);

这将产生以下结果：

1: 154971022
2: 6827322
3: 235081
4: 7554
5: 263
6: 9
7: 1

有趣的是，这立即提供了比以前所有方法更好的结果。如果16位演员有任何意义，它也会立即提出问题。毕竟，x^3会产生一个24位空间，对于x的小值，会有较大的间隙。将其与另一个移位的24位空间相结合将更好地利用可用的32位。请注意，出于同样的原因，我们仍然应该移动16（而不是8！）。

1: 162671251
2: 3276751
3: 45277
4: 473
5: 5

乘以常数（最终结果）

另一种炸掉y关键空间的方法是乘法和加法。代码现在变为：

uint x = current;
uint y = (255 ^ current);
y = (y + 7577) * 0x85ebca6b;

uint v1 = (x * x * x);
uint v2 = (y * y * y);
code = code ^ v1 ^ (v2 << 8);

虽然这似乎不是一种改进，但它的优点是我们可以使用这个技巧轻松地将8位序列扩展到任意n位序列。我移位8，因为我不希望v1的位与v2的位干涉太多。这给出了以下结果：

1: 162668435
2: 3277904
3: 45459
4: 464
5: 5

这实际上非常好！考虑到所有可能的4个id链，我们只有2％的机会发生碰撞。此外，如果我们有更大的链，我们可以使用我们用v2执行的相同技巧添加更多位（为每个额外的哈希码添加8位，因此256位哈希应该能够容纳大约29个8位id的链）。

唯一的疑问是：我们如何测试？正如@joop在他的程序中指出的那样，数学实际上非常复杂;对于大量比特和更大的链，随机抽样实际上可能证明是一种解决方案。

无序ID集的良好散列函数

3 个答案: