Question

简单地说，我有一个单词词典，我将它们添加到哈希表中。

我正在使用Double Hashing（不是传统方法），以下是产生最佳结果。

    public static int getHashKey(String word) {

        int index = 0;

        for(int i = 0; i<word.length(); i++){

            index += Math.pow(4,  i)*((int)word.charAt(i));
            index = index % size;
        }
        return index;
    }

    public static int getDoubleHashKey(String word) {

        int jump = 1;

        for(int i = 0; i<word.length(); i++){

            jump = jump * word.charAt(i);
            jump = jump % size;
        }
        return jump;

    }

这给了我127,000次碰撞。我也有2倍的主要哈希表大小，无法更改。

有没有办法改进Double Hashing算法？（上述两种方法中的任何一种）。

我知道这取决于我们在哈希表等中存储的内容，但是有任何直观的方法或一些更常用的提示，这样我就可以避免更多的冲突。

Answer 1

我在大约336 531个条目的字典上运行了一个Scala程序。版本2（118 142）的冲突明显少于版本1（305 431）。请注意，版本2接近最佳碰撞数，因为118 142 + 216 555 = 334 697，因此334 697/336 531 = 99.46％在0-216555范围内使用的值。使用模块外部，循环可以改善您的哈希方法。

import scala.io.Source

object Hash extends App {
    val size = 216555
    def doubleHashKey1(word: String) = {
        var jump = 1;
        for (ch <- word) {
            jump = jump * ch;
            jump = jump % size;
        }
        jump
    }

    def doubleHashKey2(word: String) = {
        var jump = 1;
        for (ch <- word) jump = jump * ch;
        jump % size;
    }

    def countCollisions(words: Set[String], hashFun: String => Int) = words.size - words.map(hashFun).size
    def readDictionary(path: String) = Source.fromFile(path).getLines.toSet

    val dict = readDictionary("words.txt")
    println(countCollisions(dict,doubleHashKey1))
    println(countCollisions(dict,doubleHashKey2))
}

为了处理整数溢出，必须使用不同的（但很容易实现）方式来计算模数以返回正值。另一项测试是查看碰撞是否静态分布。

双字哈希效率与单词字典

1 个答案: