Question

我持有一个非常大的内存地址列表（大约400.000），需要检查某个地址是否已经存在400.000次。

用于说明我的设置的代码示例：

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries

while (true) {
    // a new list with possible new addresses
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries

    // in my own code, these represent a new address list
    for (auto newAddress : newAddresses) {

        // already processed this address, skip it
        if (existingAddresses.find(newAddress) != existingAddresses.end()) {
          continue;
        }

        // we didn't have this address yet, so process it.
        SomeHeavyTask(newAddress);

        // so we don't process it again
        existingAddresses.emplace(newAddress);
    }

    Sleep(1000);
}

这是我提出的第一个实现，我认为它可以大大改进。

接下来我想出了一些自定义索引策略，也用于数据库。我们的想法是获取值的一部分，并使用它将其索引到自己的组集中。如果我以地址的最后两个数字为例，我将16^2 = 256组用于放置地址。

所以我最终得到这样的地图：

[FF] -> all address ending with `FF`
[EF] -> all addresses ending with `EF`
[00] -> all addresses ending with `00`
// etc...

有了这个，我只需要对相应集合中的〜360条目进行查找。导致〜360查找每秒执行400.000次。好多了！

我想知道是否有其他技巧或更好的方法来做到这一点？我的目标是尽可能快地将此地址查找。

Answer 1

std::set<uintptr_t>使用平衡树，因此查找时间为for i in $(cat FileB); do grep $i$ FileA >> File$i; done。

另一方面，

std::unordered_set<uintptr_t>是基于散列的，查找时间为O(log N)。

虽然这只是一个O(1)指标，意味着由于涉及的常数因素而无法保证改善，但当集合包含400,000个元素时，差异可能会很明显。

Answer 2

您可以使用类似于合并的算法：

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries

while (true) {
    // a new list with possible new addresses
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
    auto existing_it = existingAddresses.begin();
    auto new_it = newAddresses.begin();

    while (new_it != newAddresses.end() && existing_it != existingAddresses.end()) {
        if (*new_it < *existing_it) {
            // we didn't have this address yet, so process it.
            SomeHeavyTask(*new_it);
            // so we don't process it again
            existingAddresses.insert(existing_it, *new_it);
            ++new_it;
        } else if (*existing_it < *new_it) {
            ++existing_it;
        } else { // Both equal
            ++existing_it;
            ++new_it;
        }
    }
    for (new_it != newAddresses.end())
        // we didn't have this address yet, so process it.
        SomeHeavyTask(*new_it);
        // so we don't process it again
        existingAddresses.insert(existingAddresses.end(), *new_it);
        ++new_it;
    }
    Sleep(1000);
}

复杂性现在是线性的：O(N + M)而不是O(N log M)（新地址数量为N，旧地址数量为M）。

检查stl容器中是否已存在值的最快方法

2 个答案: