Question

在另一个主题中，我开始讨论向量和数组，其中我主要是扮演魔鬼的拥护者，按下按钮。然而，在这期间，我偶然发现了一个让我感到有些困惑的测试用例，我想对它进行真正的讨论，而不是因为扮演魔鬼的拥护者所带来的“虐待”，开始真实现在不可能讨论这个问题。然而，这个具体的例子让我很感兴趣，我无法令人满意地向自己解释。

讨论是关于Vector vs Arrays的一般性能，忽略了动态元素。例如：显然在向量中不断使用push_back（）会降低它的速度。我们假设向量和数组预先填充了数据。我提出的示例，随后由线程中的个人修改，如下所示：

#include <iostream>
#include <vector>
#include <type_traits>
using namespace std;

const int ARRAY_SIZE = 500000000;

// http://stackoverflow.com/a/15975738/500104
template <class T>
class no_init_allocator
{
public:
    typedef T value_type;

    no_init_allocator() noexcept {}
    template <class U>
        no_init_allocator(const no_init_allocator<U>&) noexcept {}
    T* allocate(std::size_t n)
        {return static_cast<T*>(::operator new(n * sizeof(T)));}
    void deallocate(T* p, std::size_t) noexcept
        {::operator delete(static_cast<void*>(p));}
    template <class U>
        void construct(U*) noexcept
        {
            // libstdc++ doesn't know 'is_trivially_default_constructible', still has the old names
            static_assert(is_trivially_default_constructible<U>::value,
            "This allocator can only be used with trivally default constructible types");
        }
    template <class U, class A0, class... Args>
        void construct(U* up, A0&& a0, Args&&... args) noexcept
        {
            ::new(up) U(std::forward<A0>(a0), std::forward<Args>(args)...);
        }
};

int main() {
    srand(5);  //I use the same seed, we just need the random distribution.
    vector<char, no_init_allocator<char>> charArray(ARRAY_SIZE);
    //char* charArray = new char[ARRAY_SIZE];
    for(int i = 0; i < ARRAY_SIZE; i++) {
        charArray[i] = (char)((i%26) + 48) ;
    }

    for(int i = 0; i < ARRAY_SIZE; i++) {
        charArray[i] = charArray[rand() % ARRAY_SIZE];
    }
}

当我在我的机器上运行时，我得到以下终端输出。第一次运行是取消注释向量线，第二次是取消注释数组行。我使用了最高级别的优化，为向量提供了成功的最佳机会。下面是我的结果，前两次运行时阵列行未注释，后两位运行矢量线。

//Array run # 1
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m20.287s
user    0m20.068s
sys 0m0.175s

//Array run # 2
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m21.504s
user    0m21.267s
sys 0m0.192s

//Vector run # 1
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m28.513s
user    0m28.292s
sys 0m0.178s

//Vector run # 2
clang++ -std=c++11 -stdlib=libc++ -o3 some.cpp -o b.out && time ./b.out

real    0m28.607s
user    0m28.391s
sys 0m0.178s

阵列优于矢量并不让我感到惊讶，然而，差异大约为50％让我感到非常惊讶，我希望它们可以忽略不计，我觉得这个测试用例的性质我是掩盖了结果的本质。当您对较小的阵列大小运行此测试时，性能差异会显着消失。

我的解释：

向量的附加实现指令导致向量指令在内存中很差地对齐，甚至可能在这个例子中，在2个不同的“块”上的非常差的点上进行分割。这导致内存在高速缓存与数据高速缓存与指令高速缓存之间来回跳转的频率高于预期。我还怀疑LLVM编译器可能夸大了弱点，并且由于一些较新的C ++ 11元素而导致优化不佳，尽管除了假设和猜想之外我没有理由进行这些解释。

我感兴趣的是A：有人可以复制我的结果而B：如果有人对计算机如何运行这个特定的基准测试有更好的解释，以及为什么向量在这个例子中如此显着地表现不佳。

我的设置：http://www.newegg.com/Product/Product.aspx?Item=N82E16834100226

Answer 1

更简单的解释：您在禁用优化的情况下构建。您想要-O3，而不是-o3。

我没有完全重现你的测试的clang，但我的结果如下：

//Array run # 1
$ g++ -std=c++11 -O3 test.cpp -o b.out && time ./b.out

real    0m25.323s
user    0m25.162s
sys 0m0.148s

//Vector run #1
$ g++ -std=c++11 -O3 test.cpp -o b.out && time ./b.out

real    0m25.634s
user    0m25.486s
sys 0m0.136s

Answer 2

我可以保证LLVM确实错误地优化了std :: vector（如果你实际上是优化的话），至少现在是这样。它没有正确地内联许多涉及的函数调用。使用GCC可以获得更好的性能。

矢量与阵列性能

2 个答案: