我持有一个非常大的内存地址列表(大约400.000
),需要检查某个地址是否已经存在400.000次。
用于说明我的设置的代码示例:
std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries
while (true) {
// a new list with possible new addresses
std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
// in my own code, these represent a new address list
for (auto newAddress : newAddresses) {
// already processed this address, skip it
if (existingAddresses.find(newAddress) != existingAddresses.end()) {
continue;
}
// we didn't have this address yet, so process it.
SomeHeavyTask(newAddress);
// so we don't process it again
existingAddresses.emplace(newAddress);
}
Sleep(1000);
}
这是我提出的第一个实现,我认为它可以大大改进。
接下来我想出了一些自定义索引策略,也用于数据库。我们的想法是获取值的一部分,并使用它将其索引到自己的组集中。如果我以地址的最后两个数字为例,我将16^2 = 256
组用于放置地址。
所以我最终得到这样的地图:
[FF] -> all address ending with `FF`
[EF] -> all addresses ending with `EF`
[00] -> all addresses ending with `00`
// etc...
有了这个,我只需要对相应集合中的〜360
条目进行查找。导致〜360
查找每秒执行400.000次。好多了!
我想知道是否有其他技巧或更好的方法来做到这一点?我的目标是尽可能快地将此地址查找。
答案 0 :(得分:11)
std::set<uintptr_t>
使用平衡树,因此查找时间为for i in $(cat FileB); do grep $i$ FileA >> File$i; done
。
std::unordered_set<uintptr_t>
是基于散列的,查找时间为O(log N)
。
虽然这只是一个O(1)
指标,意味着由于涉及的常数因素而无法保证改善,但当集合包含400,000个元素时,差异可能会很明显。
答案 1 :(得分:1)
您可以使用类似于合并的算法:
std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries
while (true) {
// a new list with possible new addresses
std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
auto existing_it = existingAddresses.begin();
auto new_it = newAddresses.begin();
while (new_it != newAddresses.end() && existing_it != existingAddresses.end()) {
if (*new_it < *existing_it) {
// we didn't have this address yet, so process it.
SomeHeavyTask(*new_it);
// so we don't process it again
existingAddresses.insert(existing_it, *new_it);
++new_it;
} else if (*existing_it < *new_it) {
++existing_it;
} else { // Both equal
++existing_it;
++new_it;
}
}
for (new_it != newAddresses.end())
// we didn't have this address yet, so process it.
SomeHeavyTask(*new_it);
// so we don't process it again
existingAddresses.insert(existingAddresses.end(), *new_it);
++new_it;
}
Sleep(1000);
}
复杂性现在是线性的:O(N + M)
而不是O(N log M)
(新地址数量为N
,旧地址数量为M
)。