检查stl容器中是否已存在值的最快方法

时间:2017-02-27 11:53:56

标签: c++ c++11

我持有一个非常大的内存地址列表(大约400.000),需要检查某个地址是否已经存在400.000次。

用于说明我的设置的代码示例:

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries

while (true) {
    // a new list with possible new addresses
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries

    // in my own code, these represent a new address list
    for (auto newAddress : newAddresses) {

        // already processed this address, skip it
        if (existingAddresses.find(newAddress) != existingAddresses.end()) {
          continue;
        }

        // we didn't have this address yet, so process it.
        SomeHeavyTask(newAddress);

        // so we don't process it again
        existingAddresses.emplace(newAddress);
    }

    Sleep(1000);
}

这是我提出的第一个实现,我认为它可以大大改进。

接下来我想出了一些自定义索引策略,也用于数据库。我们的想法是获取值的一部分,并使用它将其索引到自己的组集中。如果我以地址的最后两个数字为例,我将16^2 = 256组用于放置地址。

所以我最终得到这样的地图:

[FF] -> all address ending with `FF`
[EF] -> all addresses ending with `EF`
[00] -> all addresses ending with `00`
// etc...

有了这个,我只需要对相应集合中的〜360条目进行查找。导致〜360查找每秒执行400.000次。好多了!

我想知道是否有其他技巧或更好的方法来做到这一点?我的目标是尽可能快地将此地址查找。

2 个答案:

答案 0 :(得分:11)

std::set<uintptr_t>使用平衡树,因此查找时间为for i in $(cat FileB); do grep $i$ FileA >> File$i; done

另一方面,

std::unordered_set<uintptr_t>是基于散列的,查找时间为O(log N)

虽然这只是一个O(1)指标,意味着由于涉及的常数因素而无法保证改善,但当集合包含400,000个元素时,差异可能会很明显。

答案 1 :(得分:1)

您可以使用类似于合并的算法:

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries

while (true) {
    // a new list with possible new addresses
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
    auto existing_it = existingAddresses.begin();
    auto new_it = newAddresses.begin();

    while (new_it != newAddresses.end() && existing_it != existingAddresses.end()) {
        if (*new_it < *existing_it) {
            // we didn't have this address yet, so process it.
            SomeHeavyTask(*new_it);
            // so we don't process it again
            existingAddresses.insert(existing_it, *new_it);
            ++new_it;
        } else if (*existing_it < *new_it) {
            ++existing_it;
        } else { // Both equal
            ++existing_it;
            ++new_it;
        }
    }
    for (new_it != newAddresses.end())
        // we didn't have this address yet, so process it.
        SomeHeavyTask(*new_it);
        // so we don't process it again
        existingAddresses.insert(existingAddresses.end(), *new_it);
        ++new_it;
    }
    Sleep(1000);
}

复杂性现在是线性的:O(N + M)而不是O(N log M)(新地址数量为N,旧地址数量为M)。