Question

我正在为一个类的哈希表工作，我正在努力提高插入速度。在我的实现中，我正在使用链接。向量托管字符串列表。我必须从字典中插入超过350,000个单词（来自/ usr / share / dict / words中的“单词”）。

这是我的哈希表。任务可能需要任何奇怪的命名约定（例如MyDS）：

#ifndef _MYDS_H
#define _MYDS_H

#include "MyHash.h"
#include <string>
#include <vector>
#include <list>
#include <iostream>

using namespace std;

class MyDS
{
public:
    MyDS()
    {
        max_size = 128;
        size = 0;
        nodes.resize(max_size);
    }

// destructor

// copy constructor

// assignment operator

    void push(const string& s)
        {
            unsigned long hash = MyHash()(s) % max_size;
            list<string> & hashList = nodes[hash];

            hashList.push_back(s);

            if (++size > nodes.size())
            {
                max_size *= 4;
                rehash();
            }
        }

bool search(const string& s)
{
    unsigned long hash = MyHash()(s) % max_size;
    list<string>::iterator it = nodes[hash].begin();

    for (int i = 0; i < nodes[hash].size(); i++)
    {
        if (*it == s)
        {
            return true;
        }
        *it++;
    }

    return false;
}
private:
    void rehash()
    {
        unsigned long hash;
        list<string>::iterator it;
        vector < list<string> > newNodes = nodes;
        newNodes.resize(max_size);

        for (int i = 0; i < nodes.size(); i++)
        {
            if (nodes[i].size() > 0)
            {
                it = nodes[i].begin();
                hash = MyHash()(*it) % max_size;
                newNodes[hash] = nodes[i];
            }
        }

        nodes = newNodes;
    }

    vector< list<string> > nodes;
    int max_size;
    int size;
};

#endif

我使用的哈希函数是djb2。我的搜索功能和插入两者似乎都非常快。这是重复的，需要很长时间。

如果有更好的方法来设置我的哈希表，请告诉我。我在用于执行此项目的数据结构方面没有受到限制。

Answer 1

停止复制所有这些字符串只是为了观看它们在一分钟后刻录。试试这个：

void rehash()
{
    std::vector<std::list<std::string>> newNodes(max_size);

    for (auto & bucket : nodes)
    {
        for (auto it = bucket.begin(); it != bucket.end(); )
        {
            std::list<std::string> & newBucket = newNodes[MyHash()(*it) % max_size];
            newBucket.splice(newBucket.end(), bucket, it++);
        }
    }

    nodes.swap(newNodes);
}   //    ^^^^^^^^^^^^^^

这也可以修复你的破坏＆＃34;重复＆＃34;那并没有真正重演。

Answer 2

    if (nodes[i].size() > 0)
    {
        it = nodes[i].begin();
        hash = MyHash()(*it) % max_size;
        newNodes[hash] = nodes[i];
    }

我认为这些不正确。节点[i]中的元素应该分布到较大表中的不同节点中。因此，您需要重新计算每个元素的哈希值，而不仅仅是第一个哈希值。

Answer 3

当节点数量等于大小时，您可能不希望重新发布所有内容。在旁注中，只要向表中添加字符串，就会增加大小，因此即使一个存储桶中包含128个字符串的列表且所有其他存储桶仍为空，您也会调整存储区的数量，您确定这是逻辑吗？你有意吗？我建议在n个桶的平方根周围分配，而不是根本不重复。如果你正在使用一个好的哈希函数，那么将字符串分配到桶中将是相当均匀的，并且查找时间不会受到太大影响。

Answer 4

＆＃34;如果有更好的方法来设置我的哈希表，请告诉我。我对用于执行此项目的数据结构没有限制。＆＃34;

在这种情况下，使用现有的hashmap，如std :: unordered_map或std :: hash_map。我相信你会在课堂上失败，但你会在现实生活中获得成绩

改善哈希表的插入时间 - C ++

4 个答案: