如何提高C ++中的std :: set_intersection性能?

时间:2019-02-19 09:44:04

标签: c++ performance set hashtable

在C ++中使用std :: set和Python中使用set()的过程中,我遇到了无法解释的性能问题。在C ++中设置交集至少比在Python中慢3倍。

所以有人能指出我可以对C ++代码进行的优化和/或解释Python如何更快地做到这一点吗?

我希望当set排序时,它们两者都使用O(n)复杂度相似的算法。但是Python可能会做一些优化,使其系数变小。

set_bench.cc

#include <iostream>
#include <set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
    std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}

template <typename T>
void fill_set(std::set<T>& s, T start, T end, T step)
{
    for (T i = start; i < end; i += step) {
        s.emplace(i);
    }
}

template <typename T>
void intersect(const std::set<T>& s1, const std::set<T>& s2, std::set<T>& result)
{
    std::set_intersection(s1.begin(), s1.end(),
                            s2.begin(), s2.end(),
                            std::inserter(result, result.begin()));
}

int main()
{
    std::set<int64_t> s1;
    std::set<int64_t> s2;
    std::set<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;

    // sleep to let check memory consumption
    // while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}

set_bench.py​​

#!/usr/bin/env python3

import time

def elapsed(f, s):
    start = time.monotonic()
    f()
    elapsed = time.monotonic() - start
    print(f'{s} {elapsed} seconds')

def fill_set(s, start, end, step=1):
    for i in range(start, end, step):
        s.add(i)

def intersect(s1, s2, result):
    result.update(s1 & s2)

s1 = set()
s2 = set()

elapsed(lambda : fill_set(s1, 8, 1000*1000*100, 13), 'fill s1 took')
elapsed(lambda : fill_set(s2, 0, 1000*1000*100, 7), 'fill s2 took')

print(f's1 length = {len(s1)}, s2 length = {len(s2)}')


s3 = set()

elapsed(lambda: intersect(s1, s2, s3), 'intersect s1 and s2 took')

print(f's3 length = {len(s3)}')

# sleep to let check memory consumption
# while True: time.sleep(1)

这是在下一个环境中运行此程序的结果:

  • clang版本7.0.1
  • gcc 8.2.0
  • Python 3.7.2
  • i7-7700 CPU @ 3.60GHz
$ clang -lstdc++ -O0 set_bench.cc -o set_bench && ./set_bench
fill s1 took 5.38646 seconds
fill s2 took 10.5762 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.48387 seconds
s3 length = 1098901
$ clang -lstdc++ -O1 set_bench.cc -o set_bench && ./set_bench
fill s1 took 3.31435 seconds
fill s2 took 6.41415 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.01276 seconds
s3 length = 1098901
$ clang -lstdc++ -O2 set_bench.cc -o set_bench && ./set_bench
fill s1 took 1.90269 seconds
fill s2 took 3.85651 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.512727 seconds
s3 length = 1098901
$ clang -lstdc++ -O3 set_bench.cc -o set_bench && ./set_bench
fill s1 took 1.92473 seconds
fill s2 took 3.72621 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.523683 seconds
s3 length = 1098901
$ gcc -lstdc++ -O3 set_bench.cc -o set_bench && time ./set_bench
fill s1 took 1.72481 seconds
fill s2 took 3.3846 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.516702 seconds
s3 length = 1098901
$ python3.7 ./set_bench.py 
fill s1 took 0.9404696229612455 seconds
fill s2 took 1.082577683031559 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.17995300807524472 seconds
s3 length = 1098901

如您所见,结果是相等的,所以我假设两个程序都进行相同的计算。

顺便说一下-C ++程序的RSS是1084896 kB,Python的RSS是1590400 kB。

3 个答案:

答案 0 :(得分:4)

这篇文章中有两个问题:

  

问:如何在C ++中提高std::set_intersection的性能?

使用排序的std::vector而不是集合,这对缓存更加友好。由于相交是在单遍中按顺序完成的,因此它将尽可能快。 在我的系统上,我的运行时间为 0.04秒。 如果您需要这些,请在这里停止。

  

Q: ... Python如何如此快地完成这项工作?

或者换句话说:“ 为什么Python的设置比C ++的设置快? ”。在我的其余帖子中,我将重点讨论这个问题。

首先,Python的sethash table,而std::setbinary tree。因此,请使用std::unordered_set将苹果与苹果进行比较(基于O( logN )查找复杂度,我们拒绝了二叉树)。

还要注意,std::set_intersection仅仅是two-pointer algorithm;它在两个 sorted 集上进行迭代,仅保留匹配的值。 除了它的名称外,它与Python的set_intersection并没有什么共同之处,它本身只是一个简单的循环:

  • 遍历较小的哈希表
  • 对于每个元素,如果它存在于另一个哈希表中,请将其添加到结果中

因此我们不能对未排序的数据使用std::set_intersection,而需要实现循环:

    for (auto& v : set1) {
        if (set2.find(v) != set2.end()) {
            result.insert(v);
        }
    }

这里没什么好看的。不幸的是,尽管此算法在std::unordered_set上的直接应用会 更慢 3倍。那怎么可能? < / p>

  1. 我们观察到输入数据集的大小> 100MB。这不适合i7-7700的8MB缓存,这意味着您可以在8MB的边界内进行更多工作,程序执行得越快。

  2. Python使用类似于"dense hash table"的特殊形式的PHP hash table(通常是open addressing哈希表的类),而C ++ std::unordered_set通常是幼稚或列表向量哈希表。密集结构对缓存更友好,因此速度更快。有关实现的详细信息,请参见dictobject.csetobject.c

  3. 对于要生成的已经唯一的输入数据集,内置的C ++ std::hash<long>太复杂了。另一方面,Python对最大2 30 的整数使用标识(无操作)哈希函数(请参见long_hash)。冲突由其哈希表实现中内置的LCG摊销。您无法将其与C ++标准库功能相匹配;不幸的是,此处的身份哈希将再次导致哈希表太稀疏。

  4. Python使用自定义内存分配器pymalloc,它类似于jemalloc并针对数据局部性进行了优化。通常,它的性能优于内置的Linux tcmalloc,这是C ++程序通常使用的。

借助这些知识,我们可以设计出性能类似的C ++版本,以证明技术可行性:

#include <iostream>
#include <unordered_set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>
#include <tuple>
#include <string>

using namespace std::chrono_literals;

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    auto end = std::chrono::steady_clock::now();
    std::cout << s << " " << (end - start) / 1.0s << " seconds" << std::endl;
}

template <typename T>
struct myhash {
    size_t operator()(T x) const {
        return x / 5; // cheating to improve data locality
    }
};

template <typename T>
using myset = std::unordered_set<T, myhash<T>>;

template <typename T>
void fill_set(myset<T>& s, T start, T end, T step)
{
    s.reserve((end - start) / step + 1);
    for (T i = start; i < end; i += step) {
        s.emplace(i);
    }
}

template <typename T>
void intersect(const myset<T>& s1, const myset<T>& s2, myset<T>& result)
{
    result.reserve(s1.size() / 4); // cheating to compete with a better memory allocator
    for (auto& v : s1)
    {
        if (s2.find(v) != s2.end())
            result.insert(v);
    }
}

int main()
{
    myset<int64_t> s1;
    myset<int64_t> s2;
    myset<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000 * 1000 * 100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000 * 1000 * 100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;
}

使用此代码,在C ++和Python版本中,我的运行时间均为0.28秒。

现在,如果要击败 Python的设置性能,我们可以删除所有作弊项,并使用Google的dense_hash_set(通过二次探测实现open addressing)作为一个插件替换(只需致电set_empty_object(0))。

借助google::dense_hash_set和无操作哈希函数,我们得到:

fill s1 took 0.321397 seconds
fill s2 took 0.529518 seconds
s1 length = 7692308, s2 length = 14285714
intersect s1 and s2 took 0.0974416 seconds
s3 length = 1098901

在保持哈希集功能的同时,比Python快2.8倍!


P.S。有人会想-为什么C ++标准库实现了这么慢的哈希表? 非自由午餐定理也适用于此:基于探测的解决方案并非总是快速。作为一种机会主义的解决方案,它有时会遭受“团块”(不断探查占用的空间)的困扰。 当这种情况发生时,性能将成倍下降。 标准库实现的思想是保证所有可能的输入具有可预测的性能。不幸的是,正如Chandler Carruth在his talk中所解释的那样,尽管对现代硬件的缓存效果太大而不能忽略。

答案 1 :(得分:3)

在该基准测试中,使用排序的vector将远远胜过set

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
    std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}

template <typename T>
void fill_set(std::vector<T>& s, T start, T end, T step)
{
    for (T i = start; i < end; i += step) {
        s.emplace_back(i);
    }
    std::sort(s.begin(), s.end());
}

template <typename T>
void intersect(const std::vector<T>& s1, const std::vector<T>& s2, std::vector<T>& result)
{
    std::set_intersection(s1.begin(), s1.end(),
                            s2.begin(), s2.end(),
                            std::inserter(result, result.begin()));
}

int main()
{
    std::vector<int64_t> s1;
    std::vector<int64_t> s2;
    std::vector<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;

    // sleep to let check memory consumption
    // while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}

对我来说(clang / libc ++ -O3),结果取自:

fill s1 took 2.01944 seconds
fill s2 took 3.98959 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.55453 seconds
s3 length = 1098901

收件人:

fill s1 took 0.143026 seconds
fill s2 took 0.20209 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.0548819 seconds
s3 length = 1098901

这种性能差异的原因是vector版本中的分配要少得多。

答案 2 :(得分:2)

您没有将“喜欢”与“喜欢”进行比较。

Python集是无序(哈希)集。 std::set<>是有序集合(二叉树)。

来自python文档:

  

5.4。套装   Python还包括集合的数据类型。 集合是无序集合,没有重复的元素。基本用途包括成员资格测试和消除重复条目。集合对象还支持诸如并集,交集,差和对称差之类的数学运算。

重构以与like进行比较:

#include <iostream>
#include <unordered_set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>
#include <tuple>

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
    std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}

template <typename T>
void fill_set(std::unordered_set<T>& s, T start, T end, T step)
{
    for (T i = start; i < end; i += step) {
        s.emplace(i);
    }
}

template <typename T>
void intersect(const std::unordered_set<T>& s1, const std::unordered_set<T>& s2, std::unordered_set<T>& result)
{
    auto ordered_refs = [&]()
    {
        if (s1.size() <= s2.size())
            return std::tie(s1, s2);
        else
            return std::tie(s2, s1);
    };

    auto lr = ordered_refs();
    auto& l = std::get<0>(lr);
    auto& r = std::get<1>(lr);
    result.reserve(l.size());

    for (auto& v : l)
    {
        if (auto i = r.find(v) ; i != r.end())
            result.insert(v);
    }
}

int main()
{
    std::unordered_set<int64_t> s1;
    std::unordered_set<int64_t> s2;
    std::unordered_set<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;

    // sleep to let check memory consumption
    // while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}

性能取决于您的套件。

我怀疑您可以使用自定义分配器大大提高性能。默认值是线程安全的,等等。

这样说,在我的机器上,我只看到无序版本的速度提高了20%。我可能会猜测python相交代码已经过手动优化。

作为参考,python源代码在这里:https://github.com/python/cpython/blob/master/Objects/setobject.c