Question

最近，我参加了一次采访，并面临一个关于哈希碰撞的好问题。

问题：给定一个字符串列表，一起打印出字谜。

示例：

我/我：{行为，上帝，动物，狗，猫}

o / p：act，cat，dog，god

我想创建hashmap并将单词作为键和值列为anagrams列表

为避免冲突，我想为字谜创建唯一的哈希码，而不是排序并使用已排序的单词作为键。

我正在寻找除了使用链接之外处理碰撞的哈希算法。我希望算法为act和cat生成相同的哈希码...以便它将下一个词添加到值列表

有人能建议一个好的算法吗？

Answer 1

使用已排序的字符串进行哈希非常好，我可能已经这样做了，但它确实可能很慢而且很麻烦。这是另一个想法，不确定它是否有效 - 选择一组素数，尽可能小，与你的字符集大小相同，并从你的字符构建快速映射函数。然后对于给定的单词，将每个字符映射到匹配的素数，并乘以。最后，使用结果哈希。

这与赫斯特所建议的非常类似，只是碰撞较少（实际上，我相信不会发生错误的碰撞，因为任何数字的主要分解的唯一性）。

简单，例如 -

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word);

[编辑]

关于唯一性的几句话 - 任何整数都会对素数的乘法进行单次细分，因此，如果在散列中给出一个整数键，您实际上可以重构所有可能会对其进行散列的字符串，并且只有这些字。只需打破素数，p1 ^ n1 * p2 ^ n2 * ...并将每个素数转换为匹配的char。 p1的char将出现n1次，依此类推。你不能得到任何你没有明确使用的新素数，素数意味着你不能通过任何其他素数的乘法得到它。

这带来了另一种可能的改进 - 如果你可以构造字符串，你只需要标记填充哈希时看到的排列。由于排列可以按字典顺序排序，因此您可以用数字替换每个排列。这样可以节省将实际字符串存储在散列中的空间，但需要更多的计算，因此它不一定是一个好的设计选择。尽管如此，这对采访的原始问题来说是一个很好的复杂问题：）

Answer 2

哈希函数：为每个字符分配主编号。在计算哈希码时，获取分配给该字符的素数并乘以现有值。现在所有的字谜产生相同的哈希值。

ex： a2， c - 3 t - 7

cat的哈希值= 3 * 2 * 7 = 42 act的哈希值= 2 * 3 * 7 = 42 打印具有相同哈希值的所有字符串（字谜将具有相同的哈希值）

Answer 3

其他海报建议将字符转换为素数并将它们相乘。如果你这样做模数大的素数，你会得到一个不会溢出的好哈希函数。我针对大多数英语单词的Unix单词列表测试了以下Ruby代码，发现不是彼此字母的单词之间没有哈希冲突。（在MAC OS X上，此文件位于：/ usr / share / dict / words。）

我的word_hash函数获取每个字符mod 32的序数值。这将确保大写和小写字母具有相同的代码。我使用的大素数是2 ^ 58 - 27.任何大素数都会这样做，只要它小于2 ^ 64 / A，其中A是我的字母大小。我使用32作为我的字母大小，所以这意味着我不能使用大于约2 ^ 59的数字 - 1.由于ruby使用一位用于符号而第二位用于指示值是否为数字或一个对象，我失去了一些其他语言。

def word_hash(w)
  # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
  # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
  # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
  prime_modulus = (1 << 58) - 27
  h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end

words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count

whash = {}
inverse_hash = {}
words.each do |w|
  h = word_hash(w)
  whash[w] = h
  x = inverse_hash[h]
  if x && x.each_char.sort.join != w.each_char.sort.join
    puts "Collision between #{w} and #{x}"
  else
    inverse_hash[h] = w
  end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."

Answer 4

小实用优化，我建议上面的哈希方法是：

将最少素数分配给元音，然后分配最常出现的辅音。例如： e：2 a：3 我：5 o：7 你：11 t：13 等等...

此外，英语的平均单词长度为：~6

此外，前26个素数小于100 [2,3,5,7，...，97]

因此，平均而言，您的哈希值会产生大约100 ^ 6 = 10 ^ 12的值。

因此，如果你使用大于10 ^ 12的模数的素数，那么碰撞的可能性就会非常小。

Answer 5

上面的复杂性似乎非常错位！你不需要素数或哈希值。它只有三个简单的操作：

将每个OriginalWord映射到（SortedWord，OriginalWord）元组。 示例：＆＃34; cat＆＃34;成为（＆＃34;行为＆＃34;，＆＃34; cat＆＃34;）; ＆＃34;狗＆＃34;成为（＆＃34; dgo＆＃34;，＆＃34; dog＆＃34;）。这是对每个OriginalWord的字符的简单排序。
按元组排序第一个元素。示例：（＆＃34; dgo＆＃34;，＆＃34; dog＆＃34;），（＆＃34; act，＆＃34; cat＆＃34;）分类到（＆＃34;行为＆＃ 34;，＆＃34; cat＆＃34;），（＆＃34; dgo＆＃34;，＆＃34; dog＆＃34;）。这是对整个系列的简单排序。
通过元组迭代（按顺序），发出OriginalWord。 示例：（＆＃34; act＆＃34;，＆＃34; cat＆＃34;），（＆＃34; dgo＆＃34;，＆＃34; dog＆＃34;）发出＆＃34; cat＆＃34; ＆＃34; dog＆＃34;。这是一个简单的迭代。

需要两次迭代和两种排序！

在Scala中，它完全是一行代码：

val words = List("act", "animal", "dog", "cat", "elvis", "lead", "deal", "lives", "flea", "silent", "leaf", "listen")

words.map(w => (w.toList.sorted.mkString, w)).sorted.map(_._2)
# Returns: List(animal, act, cat, deal, lead, flea, leaf, dog, listen, silent, elvis, lives)

或者，正如原始问题所暗示的那样，您只需要计数＆gt;的情况。 1，它只是更多：

scala> words.map(w => (w.toList.sorted.mkString, w)).groupBy(_._1).filter({case (k,v) => v.size > 1}).mapValues(_.map(_._2)).values.toList.sortBy(_.head)
res64: List[List[String]] = List(List(act, cat), List(elvis, lives), List(flea, leaf), List(lead, deal), List(silent, listen))

Answer 6

使用素数积的解决方案非常出色，这是Java实现，以防任何人需要。

class HashUtility {
    private int n;
    private Map<Character, Integer> primeMap;

    public HashUtility(int n) {
        this.n = n;
        this.primeMap = new HashMap<>();
        constructPrimeMap();
    }

    /**
     * Utility to check if the passed {@code number} is a prime.
     *
     * @param number The number which is checked to be prime.
     * @return {@link boolean} value representing the prime nature of the number.
     */
    private boolean isPrime(int number) {
        if (number <= 2)
            return number == 2;
        else
            return (number % 2) != 0
                    &&
                    IntStream.rangeClosed(3, (int) Math.sqrt(number))
                            .filter(n -> n % 2 != 0)
                            .noneMatch(n -> (number % n == 0));
    }

    /**
     * Maps all first {@code n} primes to the letters of the given language.
     */
    private void constructPrimeMap() {
        List<Integer> primes = IntStream.range(2, Integer.MAX_VALUE)
                .filter(this::isPrime)
                .limit(this.n)      //Limit the number of primes here
                .boxed()
                .collect(Collectors.toList());

        int curAlphabet = 0;
        for (int i : primes) {
            this.primeMap.put((char) ('a' + curAlphabet++), i);
        }
    }

    /**
     * We calculate the hashcode of a word by calculating the product of each character mapping prime. This works since
     * the product of 2 primes is unique from the products of any other primes.
     * <p>
     * Since the hashcode can be huge, we return it modulo a large prime.
     *
     * @param word The {@link String} to be hashed.
     * @return {@link int} representing the prime hashcode associated with the {@code word}
     */
    public int hashCode(String word) {
        long primeProduct = 1;
        long mod = 100000007;
        for (char currentCharacter : word.toCharArray()) {
            primeProduct *= this.primeMap.get(currentCharacter) % mod;
        }

        return (int) primeProduct;
    }
}

请让我知道是否/如何改善这一点。

为所有字谜生成相同的唯一哈希码

6 个答案: