Question

我在Java中需要一个具有以下属性的专用哈希函数h（X，Y）。

X和Y是字符串。
h（X，Y）= h（Y，X）。
X和Y是任意长度的字符串，h（X，Y）的结果也没有长度限制。
h（X，Y）和h（Y，X）不应与h（A，B）= h（B，A）碰撞。
h（）不需要是安全散列函数，除非有必要满足上述要求。
性能相当高，但这是一个开放式的标准。

在我看来，我认为要求2和4略有矛盾，但也许我担心太多了。

目前，我在Java中所做的事情如下：

public static BigInteger hashStringConcatenation(String str1, String str2) {
    BigInteger bA = BigInteger.ZERO;
    BigInteger bB = BigInteger.ZERO;
    for(int i=0; i<str1.length(); i++) {
        bA = bA.add(BigInteger.valueOf(127L).pow(i+1).multiply(BigInteger.valueOf(str1.codePointAt(i))));
    }
    for(int i=0; i<str2.length(); i++) {
        bB = bB.add(BigInteger.valueOf(127L).pow(i+1).multiply(BigInteger.valueOf(str2.codePointAt(i))));
    }
    return bA.multiply(bB);
}

我认为这很可怕，但这就是为什么我在寻找更好的解决方案。感谢。

忘了在OS X 10.7上的2.53GHz双核Macbook Pro和8GB RAM以及Java 1.6上提到，对于两个8（ASCII）字符串，哈希函数大约需要270微秒。我怀疑随着字符串大小的增加，或者如果使用Unicode字符，这会更高。

Answer 1

为什么不将他们的hashCode一起添加？

Answer 2

您对要求4有多严格？如果答案是“不完全严格”的话。然后你可以连接两个字符串，将较小的字符串放在第一位（这会导致h（＆＃39; A＆＃39;，＆＃39; B＆＃39;）和h（＆＃39; AB＆＃）发生碰撞39;，＆＃39;＆＃39;））

如果您确定任何字符都不会出现在字符串值中，那么您可以使用单个实例作为分隔符，这将解决上面的冲突。

Answer 3

3）如果X不等于A且Y不等于B，则h（X，Y）和h（Y，X）不应与h（A，B）= h（B，A）碰撞。

我认为这个要求规定了任何产生比原始字符串更小（平均）的数字的哈希函数。

任何没有碰撞的要求都会遇到Pigeonhole Principle的障碍。

Answer 4

从第4点开始，h(x,"")永远不会与h(y,"")发生碰撞，直到x.equals(y)为真。因此，您对产生h(x,y)的内容没有大小限制，因为它会为每个唯一x产生唯一的结果。但是有无数个独特的字符串。我认为这不是正确的哈希函数。

Answer 5

今天我决定为这个哈希函数问题添加我的解决方案。它测试得不是很好，我没有测量它的性能，所以你可以用你的评论反馈我。我的解决方案位于以下位置：

public abstract class HashUtil {
    //determines that we want hash, that has size of 32 integers ( or 32*32 bits )
    private static final int hash_size = 32;

    //some constants that can be changed in sake of avoiding collisions
    private static final BigInteger INITIAL_HASH = BigInteger.valueOf(7);
    private static final BigInteger HASH_MULTIPLIER = BigInteger.valueOf(31);
    private static final BigInteger HASH_DIVIDER = BigInteger.valueOf(2).pow(32*hash_size);

    public static BigInteger computeHash(String arg){
        BigInteger hash = new BigInteger(INITIAL_HASH.toByteArray());
        for (int i=0;i<arg.length()/hash_size+1;i++){
            int[] tmp = new int[hash_size];
            for(int j=0;j<Math.min(arg.length()-32*i,32);j++){
                tmp[i]=arg.codePointAt(i*hash_size+j);
            }
            hash = hash.multiply(HASH_MULTIPLIER).add(new BigInteger(convert(tmp)).abs()).mod(HASH_DIVIDER);
        }
        //to reduce result space to something meaningful
        return hash;
    }

    public static BigInteger computeHash(String arg1,String arg2){
        //here I don't forgot about reducing of result space
        return computeHash(arg1).add(computeHash(arg2)).mod(HASH_DIVIDER);
    }

    private static byte[] convert(int[] arg){
        ByteBuffer byteBuffer = ByteBuffer.allocate(arg.length*4);
        IntBuffer intBuffer = byteBuffer.asIntBuffer();
        intBuffer.put(arg);
        return byteBuffer.array();
    }

    public static void main(String[] args){
        String firstString="dslkjfaklsjdkfajsldfjaldsjflaksjdfklajsdlfjaslfj",secondString="unejrng43hti9uhg9rhe3gh9rugh3u94htfeiuwho894rhgfu";
        System.out.println(computeHash(firstString,secondString).equals(computeHash(secondString,firstString)));
    }

}

我认为我的解决方案不应该对长度小于32的单个字符串产生任何冲突（更准确地说，对于长度小于hash_size变量值的单个字符串）。此外，发现碰撞并不容易（我认为）。要管理特定任务的哈希冲突概率，您可以在7和31变量中尝试使用其他素数而不是INITIAL_HASH和HASH_MULTIPLIER。你怎么看待这件事？这对你有好处吗？

P.S。我认为如果你尝试更大的素数会更好。

Answer 6

构建String＃hashCode，这不是一个完美的哈希函数，因此它不符合条件4.

public static long hashStringConcatenation(String str1, String str2) {
    int h1 = str1.hashCode();
    int h2 = str2.hashCode();

    if ( h1 < h2 )
    {
        return ((long)h1)<<32 & h2;
    }
    else
    {
        return ((long)h2)<<32 & h1;
    }
}

Answer 7

好的，@ gkuzmin的评论让我想到我为什么要做127的权力。所以，这是一个稍微简单的代码版本。变化如下：

我不再使用127的功能，但实际上将codePointAt数字连接为字符串，将结果转换为每个输入字符串的BigInteger，然后添加两个BigInteger。
为了压缩答案，我在最终答案中做了一个mod 2 ^ 1024。

速度不是更好（也许有点差！）然后我认为我测量速度的方式不对，因为它可能还测量了函数调用所花费的时间。

这是修改后的代码。这是否满足所有条件，尽管4对于在2 ^ 1024结果空间中可能发生重复的不幸情况？

public static BigInteger hashStringConcatenation(String str1, String str2) {
    if(str1==null || str1.isEmpty() || str2 == null || str2.isEmpty()) {
        return null;
    }
    BigInteger bA, bB;
    String codeA = "", codeB = "";
    for(int i=0; i<str1.length(); i++) {
        codeA += str1.codePointAt(i);
    }
    for(int i=0; i<str2.length(); i++) {
        codeB += str2.codePointAt(i);
    }
    bA = new BigInteger(codeA);
    bB = new BigInteger(codeB);
    return bA.add(bB).mod(BigInteger.valueOf(2).pow(1024));
}

Answer 8

我决定添加另一个答案，因为@Anirban Basu提出了另一个解决方案。所以，我不知道如何提供他的帖子的链接，如果有人知道如何做到这一点 - 纠正我。

Anirban的解决方案如下：

public static BigInteger hashStringConcatenation(String str1, String str2) {
    if(str1==null || str1.isEmpty() || str2 == null || str2.isEmpty()) {
        return null;
    }
    BigInteger bA, bB;
    String codeA = "", codeB = "";
    for(int i=0; i<str1.length(); i++) {
        codeA += str1.codePointAt(i);
    }
    for(int i=0; i<str2.length(); i++) {
        codeB += str2.codePointAt(i);
    }
    bA = new BigInteger(codeA);
    bB = new BigInteger(codeB);
    return bA.add(bB).mod(BigInteger.valueOf(2).pow(1024));
}

您的新解决方案现在看起来像哈希函数，但它仍然存在一些问题。我建议你考虑一下：

当NullPointerException用作函数参数时，抛出IllegalArgumentException或null可能会更好吗？您确定，您不想为空字符串计算哈希值吗？
要连接大量字符串，最好使用StringBuffer代替+运算符。使用此类将对您的代码性能产生巨大的积极影响。
你的哈希函数不是很安全 - 计算字符串非常容易，这会产生冲突。

您可以尝试使用此代码来检查可以演示哈希函数冲突的算法。

public static void main(String[] args){
    String firstString=new StringBuffer().append((char)11).append((char)111).toString();
    String secondString=new StringBuffer().append((char)111).append((char)11).toString();

    BigInteger hash1 = hashStringConcatenation(firstString,"arbitrary_string");
    BigInteger hash2 = hashStringConcatenation(secondString,"arbitrary_string");
    System.out.println("Is hash equal: "+hash1.equals(hash2));
    System.out.println("Conflicted values: {"+firstString+"},{"+secondString+"}");
}

因此，打破哈希函数真的很容易。此外，它有2 ^ 1024个结果空间是好的，但是你的实现存在许多现实生活中的冲突，它们是非常接近和简单的字符串。

P.S。我认为你应该阅读一些关于已经开发的哈希算法，在现实生活中失败的哈希函数（比如java String类哈希函数，它在过去仅使用16个第一个字符计算哈希）并尝试检查你的解决方案根据您的要求和现实生活。至少你可以尝试手动找到哈希冲突，如果你成功了，那么你的解决方案很可能已经存在一些问题。

Answer 9

根据@ gkuzmin的建议，这是我改变的代码：

public static BigInteger hashStringConcatenation(String str1, String str2) {
    BigInteger bA = BigInteger.ZERO, bB = BigInteger.ZERO;
    StringBuffer codeA = new StringBuffer(), codeB = new StringBuffer();
    for(int i=0; i<str1.length(); i++) {
        codeA.append(str1.codePointAt(i));
    }
    for(int i=0; i<str2.length(); i++) {
        codeB.append(str2.codePointAt(i));
    }
    bA = new BigInteger(codeA.toString());
    bB = new BigInteger(codeB.toString());
    return bA.multiply(bB).mod(BigInteger.valueOf(2).pow(1024));
}

请注意，在结果中，我现在将bA乘以bB而不是添加。

另外，添加了@ gkuzmin建议的测试功能：

public static void breakTest2() {
    String firstString=new StringBuffer().append((char)11).append((char)111).toString();
    String secondString=new StringBuffer().append((char)111).append((char)11).toString();
    BigInteger hash1 = hashStringConcatenation(firstString,"arbitrary_string");
    BigInteger hash2 = hashStringConcatenation(secondString,"arbitrary_string");
    System.out.println("Is hash equal: "+hash1.equals(hash2));
    System.out.println("Conflicted values: {"+firstString+"},{"+secondString+"}");
}

和另一个只包含数值的字符串的测试：

public static void breakTest1() {
    Hashtable<String,String> seenTable = new Hashtable<String,String>();
    for (int i=0; i<100; i++) {
        for(int j=i+1; j<100; j++) {
            String hash = hashStringConcatenation(""+i, ""+j).toString();
            if(seenTable.contains(hash)) {
                System.out.println("Duplication for " + seenTable.get(hash) + " with " + i + "-" + j);
            }
            else {
                seenTable.put(hash, i+"-"+j);
            }
        }
    }
}

代码运行。当然，它不是详尽的检查，但breakTest1（）函数没有任何问题。 @ gkuzmin的功能显示以下内容：

Is hash equal: true
Conflicted values: {                    o},{o                         }

为什么两个字符串产生相同的哈希？因为它们在两种情况下都有效地使用字符串'11111arbitrary_string'。这是一个问题。

Answer 10

现在稍微修改过的功能怎么样？

public static BigInteger hashStringConcatenation(String str1, String str2) {
    BigInteger bA = BigInteger.ZERO, bB = BigInteger.ZERO;
    StringBuffer codeA = new StringBuffer(), codeB = new StringBuffer();
    for(int i=0; i<str1.length(); i++) {
        codeA.append(str1.codePointAt(i)).append("0");
    }
    for(int i=0; i<str2.length(); i++) {
        codeB.append(str2.codePointAt(i)).append("0");
    }
    bA = new BigInteger(codeA.toString());
    bB = new BigInteger(codeB.toString());
    return bA.multiply(bB).mod(BigInteger.valueOf(2).pow(1024));
}

这里，我们在每个字符代码之间添加一个分隔符“0”，因此字符11 111和111 11的组合将不再混淆该函数，因为连接将产生110111和111011.但是，它仍然不会破坏原问题的要求2。

所以这现在解决了问题，虽然在2 ^ 1024范围的范围内？

字符串连接上的特定类型的哈希

10 个答案: