假设以下代码:
#include <thread>
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
#include <cmath>
long long int partialSum(const std::vector<int>& v, int begin, int end) {
long long int sum = 0;
for (int i = begin; i < end; i++) {
sum += (v[i]);
}
return sum;
}
int main() {
std::vector<int> v(10000000, 1);
//2 threads
auto start = std::chrono::high_resolution_clock::now();
std::future<long long int> f1 = std::async(std::launch::async, partialSum, v, 0, 10000000 /2);
std::future<long long int> f2 = std::async(std::launch::async, partialSum, v, 10000000 / 2, 10000000);
volatile long long int a = f1.get() + f2.get();
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono:: duration_cast<std::chrono::microseconds>(end - start);
std::cout << "With 2 threads-> " << duration.count() << std::endl;
//1 thread
start = std::chrono::high_resolution_clock::now();
volatile long long int b = partialSum(v, 0, 10000000);
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "With 1 thread-> " << duration.count() << std::endl;
}
以及我机器的输出(VS2019):
With 2 threads-> 35477
With 1 thread-> 7000
请注意,我必须添加volatile
以避免编译器执行更多优化。
另请注意,我知道accumulate
中有一个std
,但是我目前正在学习多线程,这是一个POC。
基本上,我想知道编译器在这里进行哪种优化,因为与线程版本相比,它的优化效果非常好。
当我更改partialSum
并替换操作(也许是log10
)时,线程版本是常规版本的两倍。
编辑: 经过一些建议,我将代码更改为以下代码:
#include <thread>
#include <iostream>
#include <vector>
#include <future>
#include <chrono>
#include <cmath>
long long int partialSum(const std::vector<int>& v, int begin, int end) {
long long int sum = 0;
for (int i = begin; i < end; i++) {
sum += (v[i]);
}
return sum;
}
int main() {
std::vector<int> v(10000000, 1);
//2 threads
auto start = std::chrono::high_resolution_clock::now();
std::future<long long int> f1 = std::async(std::launch::async, partialSum, std::cref(v), 0, 10000000 /2);
std::future<long long int> f2 = std::async(std::launch::async, partialSum, std::cref(v), 10000000 / 2, 10000000);
volatile long long int a = f1.get() + f2.get();
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono:: duration_cast<std::chrono::microseconds>(end - start);
std::cout << "With 2 threads-> " << duration.count() << std::endl;
//1 thread
start = std::chrono::high_resolution_clock::now();
f1 = std::async(std::launch::async, partialSum, std::cref(v), 0, 10000000);
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "With 1 thread-> " << duration.count() << std::endl;
}
并输出:
With 2 threads-> 11835
With 1 thread-> 0
答案 0 :(得分:4)
之所以存在性能差距,是因为您实际上是在此处复制矢量:
std::future<long long int> f1 = std::async(std::launch::async, partialSum, v, 0, 10000000 /2);
std::future<long long int> f2 = std::async(std::launch::async, partialSum, v, 10000000 / 2, 10000000);
使用std::cref
通过const引用传递它们:
std::future<long long int> f1 = std::async(std::launch::async, partialSum, std::cref(v), 0, 10000000 /2);
std::future<long long int> f2 = std::async(std::launch::async, partialSum, std::cref(v), 10000000 / 2, 10000000);
然后尝试再次测量性能。对我来说,进行此更改后2线程版本更快。在这里尝试:Godbolt link
您的第二个代码段正在打印0
的1个线程,因为您没有等待f1
完成。
在获得end
值之前先输入以下内容:
volatile long long int b = f1.get();
至于循环优化(这部分对于OP来说可能是不必要的),编译器(GCC)正在向量化循环(没有任何-march=
选项)。生成的asm如下:
.L399:
movdqu xmm0, XMMWORD PTR [rax]
movdqa xmm2, xmm4
add rax, 16
pcmpgtd xmm2, xmm0
movdqa xmm3, xmm0
punpckldq xmm3, xmm2
punpckhdq xmm0, xmm2
paddq xmm1, xmm3
paddq xmm1, xmm0
cmp rdx, rax
jne .L399
如果使用int
而不是long long int
,我们可以使编译器更容易进一步优化它。然后将asm输出减少为:
pxor xmm0, xmm0
.L13:
movdqu xmm2, XMMWORD PTR [rdx]
add rdx, 16
paddd xmm0, xmm2
cmp rcx, rdx
jne .L13