Question

下面的代码将打印出我在哈希表（其中包含一堆链表）中可以找到的最高频率10次。我需要我的代码来打印哈希表中的前10个频率。我不知道怎么做（代码示例很棒，简单的英文逻辑/伪代码也很棒）。

我创建了一个名为'tmp'的临时散列列表，它指向我的散列表'hashtable'
while循环然后遍历列表并查找最高频率，这是一个int'tmp-＆gt; freq'
循环将继续此过程，使用变量'topfreq'复制它找到的最高频率，直到它到达散列表上链接列表的末尾。

我的'node'是一个结构，包含变量'freq'（int）和'word'（128 char）。当循环没有其他东西可以搜索时，它会在屏幕上打印这两个值。

问题是，我无法解决如何从我刚发现的数字中找到下一个最低数字（这可能包括具有相同频率值的另一个节点，所以我必须检查这个词也不一样了。）

void toptenwords()
{
    int topfreq = 0;
    int minfreq = 0;
    char topword[SIZEOFWORD];

    for(int p = 0; p < 10; p++) // We need the top 10 frequencies... so we do this 10 times
    {
        for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
        {
            node* tmp;
            tmp = hashtable[m];

            while(tmp != NULL) // Walk through the entire linked list
            {
                if(tmp->freq > topfreq) // If the freqency on hand is larger that the one found, store...
                {
                    topfreq = tmp->freq;
                    strcpy(topword, tmp->word);
                }
                tmp = tmp->next;
            }
        }
        cout << topfreq << "\t" << topword << endl;
    }
}

非常感谢任何和所有帮助：）

Answer 1

保留一个包含10个节点指针的数组，并将每个节点插入到数组中，按排序顺序维护数组。数组中的第十一个节点在每次迭代时都会被覆盖并包含垃圾。

void toptenwords()
{
        int topfreq = 0;
        int minfreq = 0;
        node *topwords[11];
        int current_topwords = 0;

        for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
        {
                node* tmp;
                tmp = hashtable[m];

                while(tmp != NULL) // Walk through the entire linked list
                {
                        topwords[current_topwords] = tmp;
                        current_topwords++;
                        for(int i = current_topwords - 1; i > 0; i--)
                        {
                                if(topwords[i]->freq > topwords[i - 1]->freq)
                                {
                                        node *temp = topwords[i - 1];
                                        topwords[i - 1] = topwords[i];
                                        topwords[i] = temp;
                                }
                                else break;
                        }
                        if(current_topwords > 10) current_topwords = 10;
                        tmp = tmp->next;
                }
        }
}

Answer 2

我会维护一组已经使用过的单词并更改最内层if条件以测试频率大于之前的顶级频率AND tmp-> word不在已使用的单词列表中。

Answer 3

当遍历哈希表（然后遍历其中包含的每个链表）时，保持自平衡二叉树（std :: set）作为“结果”列表。当您遇到每个频率时，将其插入列表中，如果列表超过10个，则截断列表。完成后，您将拥有前十个频率的集合（排序列表），您可以根据需要进行操作。

通过在哈希表本身中使用集合而不是链接列表，可能会有增益，但您可以自己解决这个问题。

Answer 4

第1步（效率低下）：

通过插入排序将向量移动到已排序的容器中，但插入到大小为10的容器（例如，链表或向量）中，并删除从列表底部掉落的所有元素。

第2步（高效）：

与步骤1相同，但要跟踪列表底部项目的大小，如果当前项目太小，则完全跳过插入步骤。

Answer 5

包含单词链接列表的哈希表似乎是一种特殊的数据结构，如果目标是累积，则使用的是字频率。

尽管如此，获得十个最高频率节点的有效方法是将每个节点插入优先级队列/堆，例如Fibonacci堆，其具有O（1）插入时间和O（n）删除时间。假设对哈希表表的迭代很快，这个方法的运行时间为O（n×O（1）+ 10×O（n））≡O（n）。

Answer 6

假设总共有 n 个单词，我们需要最频繁的 k 单词（此处 k = 10）。< / p>

如果 n 比 k 大得多，我所知道的最有效的方法是维持一个最小堆（即顶部元素的最小值堆中所有元素的频率）。在每次迭代中，您将下一个频率插入堆中，如果堆现在包含 k +1个元素，则删除 最小。这样，堆始终保持 k 元素的大小，包含到目前为止看到的 k 最高频率元素。在处理结束时，按升序读出 k 最高频率元素。

时间复杂度：对于每个 n 字，我们做两件事：插入最多 k 的堆中，然后删除最小元素。每个操作都花费O（log k）时间，因此整个循环需要O（nlog k）时间。最后，我们从最大 k 的堆中读出 k 元素，取O（klog k）时间，总时间为O（（n + k））log k）。因为我们知道 k ＆lt; n ，O（klog k）最差O（nlog k），因此可以简化为O（nlog k）。

Answer 7

绝对最快的方法是使用SoftHeap。使用SoftHeap，您可以在O（n）时间内找到前10个项目，而此处发布的每个其他解决方案都需要O（n lg n）时间。

http://en.wikipedia.org/wiki/Soft_heap

这篇维基百科文章展示了如何使用softheap找到O（n）时间的中位数，前10位只是中位数问题的一个子集。然后，如果您按顺序需要，您可以对前10名中的项目进行排序，并且由于您总是最多排序10项，所以它仍然是O（n）时间。

Answer 8

我最终想通了......

void toptenwords()
{
    int topfreq = 0;
    char topword[SIZEOFWORD];
    int counter = 0;

    cout << "\nTop Ten Words" << endl;

    for(int m = 0; m < HASHTABLESIZE; m++) // We need to find the highest frequency first...
    {
        node* tmp;
        tmp = hashtable[m];

        while(tmp != NULL) // Walk through the entire linked list
        {
            if(tmp->freq > topfreq) // If the freqency on hand is larger that the one found, store...
            {
                topfreq = tmp->freq;
                strcpy(topword, tmp->word);
            }
            tmp = tmp->next;
        }
    }

    while(topfreq > 0 && counter < 10) // This will now find the top 10 frequencies
    {       
        for(int m = 0; m < HASHTABLESIZE; m++) // Go through the entire hast table
        {
            node* tmp;
            tmp = hashtable[m];

            while(tmp != NULL) // Walk through the entire linked list
            {
                if(tmp->freq == topfreq) // If the freqency we're on is equal to the frequency we're keeping count of...
                {
                    counter++;
                    if(counter > 10) // We only want the top 10 words
                        break;
                    topfreq = tmp->freq; // Store the node details...
                    strcpy(topword, tmp->word);
                    cout << topfreq << "\t" << topword << endl;
                }
                tmp = tmp->next;
            }
        }

        topfreq--; // If counter is never incremented again, this will surely kill the loop... eventually.
    }
}

执行大约需要30-60秒，但它可以完成任务。我对效率知之甚少（显然需要花费很多时间）。

具有链接列表的哈希表中的前10个频率

8 个答案: