C ++如何在使用哈希函数时计算冲突数?

时间:2017-04-09 15:52:05

标签: c++ hash

我被分配了这个实验室,我需要创建一个哈希函数,并计算散列最多30000个元素的文件时发生的冲突数。这是我到目前为止的代码

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

long hashcode(string s){
  long seed = 31; 
  long hash = 0;
  for(int i = 0; i < s.length(); i++){
    hash = (hash * seed) + s[i];
  }
  return hash % 10007;
};

int main(int argc, char* argv[]){
  int count = 0;
  int collisions = 0;
  fstream input(argv[1]);
  string x;
  int array[30000];

  //File stream
  while(!input.eof()){
    input>>x;
    array[count] = hashcode(x);
    count++;
    for(int i = 0; i<count; i++){
        if(array[i]==hashcode(x)){
            collisions++;
        }
    }
  }
  cout<<"Total Input is " <<count-1<<endl;
  cout<<"Collision # is "<<collisions<<endl;
}

我只是不确定如何计算碰撞次数。我尝试将每个散列值存储到一个数组然后搜索该数组,但当只有10000个元素时,它会产生12000个冲突。任何关于如何计算冲突或甚至我的哈希函数可以使用改进的建议都将受到赞赏。谢谢。

3 个答案:

答案 0 :(得分:3)

问题是你在重述碰撞(假设你的列表中有4个相同的元素而没有别的,并通过你的算法来查看你计算的碰撞次数)

相反,创建一组哈希码,每次计算哈希码时,检查它是否在集合中。如果它在集合中,则增加冲突总数。如果它不在集合中,请将其添加到集合中。

修改

为了快速修补算法,我已经完成了以下操作:循环后递增计数,一旦发现碰撞就断开for循环。这仍然不是超级高效的,因为我们循环遍历所有结果(使用设置数据结构会更快)但这应该至少是正确的。

还调整了它,所以我们不会一遍又一遍地计算哈希码(x):

POST/comments

答案 1 :(得分:1)

为了教育的利益而增加了答案。这可能是你教授的下一课。

几乎可以肯定,检测哈希冲突的最有效方法是使用哈希集(a.k.a. unordered_set)

#include <iostream>
#include <unordered_set>
#include <fstream>
#include <string>

// your hash algorithm
long hashcode(std::string const &s) {
    long seed = 31;
    long hash = 0;
    for (int i = 0; i < s.length(); i++) {
        hash = (hash * seed) + s[i];
    }
    return hash % 10007;
};

int main(int argc, char **argv) {
    std::ifstream is{argv[1]};
    std::unordered_set<long> seen_before;
    seen_before.reserve(10007);
    std::string buffer;
    int collisions = 0, count = 0;
    while (is >> buffer) {
        ++count;
        auto hash = hashcode(buffer);
        auto i = seen_before.find(hash);
        if (i == seen_before.end()) {
            seen_before.emplace_hint(i, hash);
        }
        else {
            ++collisions;
        }
    }
    std::cout << "Total Input is " << count << std::endl;
    std::cout << "Collision # is " << collisions << std::endl;
}

答案 2 :(得分:0)

有关哈希表的说明,请参阅How does a hash table work?

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

// Generate a hash code that is in the range of our hash table.
// The range we are using is zero to 10,007 so that our table is
// large enough and the prime number size reduces the probability
// of collisions from different strings hashing to the same value.
unsigned long hashcode(string s){
    unsigned long seed = 31;
    unsigned long hash = 0;
    for (int i = 0; i < s.length(); i++){
        hash = (hash * seed) + s[i];
    }
    // we want to generate a hash code that is the size of our table.
    // so we mod the calculated hash to ensure that it is in the proper range
    // of our hash table entries. 10007 is a prime number which provides
    // better characteristics than a non-prime number table size.
    return hash % 10007; 
};

int main(int argc, char * argv[]){
    int count = 0;
    int collisions = 0;
    fstream input(argv[1]);
    string x;
    int array[30000] = { 0 };

    //File stream
    while (!input.eof()){
        input >> x;     // get the next string to hash
        count++;        // count the number of strings hashed.
        // hash the string and use the hash as an index into our hash table.
        // the hash table is only used to keep a count of how many times a particular
        // hash has been generated. So the table entries are ints that start with zero.
        // If the value is greater than zero then we have a collision.
        // So we use postfix increment to check the existing value while incrementing
        // the hash table entry.
        if ((array[hashcode(x)]++) > 0)
            collisions++;
    }
    cout << "Total Input is " << count << endl;
    cout << "Collision # is " << collisions << endl;

    return 0;
}