编译器在这里执行哪个循环优化?

时间:2020-07-28 20:37:17

标签: c++ multithreading for-loop optimization

假设以下代码:

#include <thread>
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
#include <cmath>

long long int partialSum(const std::vector<int>& v, int begin, int end) {
    long long int sum = 0;
    for (int i = begin; i < end; i++) {
        sum += (v[i]);
    }
    return sum;
}

int main() {
    std::vector<int> v(10000000, 1);

    //2 threads
    auto start = std::chrono::high_resolution_clock::now();
    std::future<long long int> f1 = std::async(std::launch::async, partialSum, v, 0, 10000000 /2);
    std::future<long long int> f2 = std::async(std::launch::async, partialSum, v, 10000000 / 2, 10000000);
    volatile long long int a = f1.get() + f2.get();
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono:: duration_cast<std::chrono::microseconds>(end - start);
    std::cout << "With 2 threads-> " << duration.count() << std::endl;

    //1 thread
    start = std::chrono::high_resolution_clock::now();
    volatile long long int b = partialSum(v, 0, 10000000);
    end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    std::cout << "With 1 thread-> " << duration.count() << std::endl;
}

以及我机器的输出(VS2019):

With 2 threads-> 35477
With 1 thread-> 7000

请注意,我必须添加volatile以避免编译器执行更多优化。 另请注意,我知道accumulate中有一个std,但是我目前正在学习多线程,这是一个POC。 基本上,我想知道编译器在这里进行哪种优化,因为与线程版本相比,它的优化效果非常好。 当我更改partialSum并替换操作(也许是log10)时,线程版本是常规版本的两倍。

编辑: 经过一些建议,我将代码更改为以下代码:

#include <thread>
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
#include <cmath>

long long int partialSum(const std::vector<int>& v, int begin, int end) {
    long long int sum = 0;
    for (int i = begin; i < end; i++) {
        sum += (v[i]);
    }
    return sum;
}

int main() {
    std::vector<int> v(10000000, 1);

    //2 threads
    auto start = std::chrono::high_resolution_clock::now();
    std::future<long long int> f1 = std::async(std::launch::async, partialSum, std::cref(v), 0, 10000000 /2);
    std::future<long long int> f2 = std::async(std::launch::async, partialSum, std::cref(v), 10000000 / 2, 10000000);
    volatile long long int a = f1.get() + f2.get();
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono:: duration_cast<std::chrono::microseconds>(end - start);
    std::cout << "With 2 threads-> " << duration.count() << std::endl;

    //1 thread
    start = std::chrono::high_resolution_clock::now();
    f1 = std::async(std::launch::async, partialSum, std::cref(v), 0, 10000000);
    end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "With 1 thread-> " << duration.count() << std::endl;
}

并输出:

With 2 threads-> 11835
With 1 thread-> 0

1 个答案:

答案 0 :(得分:4)

之所以存在性能差距,是因为您实际上是在此处复制矢量:

std::future<long long int> f1 = std::async(std::launch::async, partialSum, v, 0, 10000000 /2);
std::future<long long int> f2 = std::async(std::launch::async, partialSum, v, 10000000 / 2, 10000000);

使用std::cref通过const引用传递它们:

std::future<long long int> f1 = std::async(std::launch::async, partialSum, std::cref(v), 0, 10000000 /2);
std::future<long long int> f2 = std::async(std::launch::async, partialSum, std::cref(v), 10000000 / 2, 10000000);

然后尝试再次测量性能。对我来说,进行此更改后2线程版本更快。在这里尝试:Godbolt link

您的第二个代码段正在打印0的1个线程,因为您没有等待f1完成。

在获得end值之前先输入以下内容:

volatile long long int b = f1.get();

至于循环优化(这部分对于OP来说可能是不必要的),编译器(GCC)正在向量化循环(没有任何-march=选项)。生成的asm如下:

.L399:
        movdqu  xmm0, XMMWORD PTR [rax]
        movdqa  xmm2, xmm4
        add     rax, 16
        pcmpgtd xmm2, xmm0
        movdqa  xmm3, xmm0
        punpckldq       xmm3, xmm2
        punpckhdq       xmm0, xmm2
        paddq   xmm1, xmm3
        paddq   xmm1, xmm0
        cmp     rdx, rax
        jne     .L399

如果使用int而不是long long int,我们可以使编译器更容易进一步优化它。然后将asm输出减少为:

        pxor    xmm0, xmm0
.L13:
        movdqu  xmm2, XMMWORD PTR [rdx]
        add     rdx, 16
        paddd   xmm0, xmm2
        cmp     rcx, rdx
        jne     .L13