Question

我正在构建一个C ++开放式寻址哈希表。它由以下数组组成：

struct KeyValue {
    K key;
    V value;
}

类型Key有两个特殊元素：空和墓碑。第一个用于注意插槽是空闲的，第二个用于注意插槽已被使用但后来被删除（这是探测所必需的。）

主要挑战是为此结构设计有效的API。我希望尽量减少密钥被哈希并查找一个槽的次数。

到目前为止，我有以下API，我发现它不安全：

// Return the slot index if the key is in the table
// or a slot index where I can construct the KeyValue
// if the key is not here (or -1 if there is no slot
// available and the insertion of such a key would
// need to grow the hash table)
int search(const K& key)

// Tells if the slot is empy (or if i == -1)
bool empty(int i)

// Construct a KeyValue in the HashTable in the slot i
// which has been found by search. The i might be changed
// if the table needs to grow.
void insert(const K& key, const V& value, int& i)

// Accessors for a slot i which is occupied
const V& value(int i);

请注意，该表还具有经典功能，例如

void insert(const K& key, const V& value)

计算哈希值，搜索一个槽，然后将该对插入表中。但我想集中讨论允许程序员非常有效地使用表的接口。

例如，这是一个函数，如果它从未被计算过，则返回f（key）的值，或者如果已经计算了则从HashTable返回它的值。

const V& compute(const K& key, HashTable<K, V>& table) {
    int i = table.search(key);
    if (table.empty(i)) {
        table.insert(key, f(key), i);
    }
    return table.value(i);
 }

我并不完全热衷于这个HashTable的接口，因为方法插入（const K＆amp;，const V＆amp;，int＆amp;）对我来说真的不安全。

您对更好的API有什么建议吗？

PS：Chandler Carruth谈论“性能与算法，效率与数据结构”，特别是在23:50之后，对于理解std :: unordered_map

的问题非常好

Answer 1

我认为你应该尝试超快速散列函数。

查看https://github.com/Cyan4973/xxHash。我引用它的描述：“xxHash是一种极速的哈希算法，运行在RAM速度限制。它成功地完成了SMHasher测试套件，它评估了哈希函数的冲突，色散和随机性质。代码是高度可移植的，哈希是相同的所有平台（小/大端）。“

此网站上另一个问题的帖子：Fast Cross-Platform C/C++ Hashing Library。众所周知，FNV，Jenkins和MurmurHash很快。

看一下这篇帖子，我在这里发布了我在这里做的相同答案，还有其他答案： Are there faster hash functions for unordered_map/set in C++?

Answer 2

您可以制作一个get_or_insert函数模板，该模板接受任意函子而不是值。然后可以使用lambda调用它：

template <class K, class V>
class HashTable {
private:
    int search(const K& key);
    bool empty(int i);
    void insert(const K& key, const V& value, int& i);
    const V& value(int i);

public:    
    template <class F>
    const V& get_or_insert(const K& key, F&& f) {
        int i = search(key);
        if (empty(i)) {
            insert(key, f(), i);
        }
        return value(i);
    }
};

double expensive_computation(int key);

void foo() {
    HashTable<int, double> ht;
    int key = 42;
    double value = ht.get_or_insert(key, [key]{ return expensive_computation(key); });
}

如果内联get_or_insert并且您不需要捕获很多内容，那么这应该与您显示的代码一样高效。如有疑问，请使用Godbolt的Compiler Explorer或类似工具比较生成的代码。（而且，如果没有内联，它仍然可以，除非您必须捕获很多不同的变量。假设您捕获了智能对象-即如果复制成本很高，则通过引用捕获内容。）

注意：在C ++中传递函子的“标准”方法似乎是按值进行的，但我认为按引用传递更有意义。如果所有内容都内联了，那应该没什么不同（在我与GCC，Clang和MSVC进行检查的示例中也没有），并且如果没有内联get_or_insert的调用，您真的不会如果它捕获了超过1个或2个小的琐碎变量，就不想复制该仿函数。

使用通用引用的唯一缺点是，如果您有一个仿函数可以在operator()中改变其状态，那么我可以想象。有了这样的函子，至少在我能想到的例子中，我想要要被突变的原始函子。因此，IMO并不是真正的缺点。

或上述方法的修改版本，如果创建/分配/销毁值非常昂贵（例如std::string），则适合：用可变引用引用该插槽中的值来调用仿函数。然后函子可以直接在哈希表中分配/更改值->无需构造和销毁临时变量。

用于开放寻址哈希表的高效C ++ API

2 个答案: