我正在尝试使用以下代码进行比较
串行和并行的表现 (非lambda和lambda)。
#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
void squarecalc(int a)
{
a *= a;
}
void serial_apply_square(int* a)
{
for (int i = 0; i<MAX; i++)
squarecalc(*(a + i));
}
class apply_square
{
int* my_a;
public:
void operator()(const blocked_range<size_t>& r) const
{
int *a = my_a;
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
{
for (size_t i = r.begin(); i != r.end(); ++i)
squarecalc(a[i]);
}
);
}
int main()
{
std::chrono::time_point<std::chrono::system_clock> start, end;
int i = 0;
int* a = new int[MAX];
fstream of;
of.open("newfile", ios::in);
while (i<MAX)
{
of >> a[i];
i++;
}
start = std::chrono::system_clock::now();
serial_apply_square(a);
end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
cout << "\nTime for serial execution :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [without lambda] :" << elapsed_seconds.count() << endl;
start = std::chrono::system_clock::now();
parallel_apply_square_lambda(a, MAX);
end = std::chrono::system_clock::now();
elapsed_seconds = end - start;
cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
free(a);
}
简而言之,它只是以串行和并行方式计算10000000个数字的平方。下面是我多次执行的输出 目标代码。
**1st execution**
Time for serial execution :0.043183
Time for parallel execution [without lambda] :0.035238
Time for parallel execution [with lambda] :0.036719
**2nd execution**
Time for serial execution :0.043252
Time for parallel execution [without lambda] :0.035403
Time for parallel execution [with lambda] :0.036811
**3rd execution**
Time for serial execution :0.043241
Time for parallel execution [without lambda] :0.035355
Time for parallel execution [with lambda] :0.036558
**4th execution**
Time for serial execution :0.043216
Time for parallel execution [without lambda] :0.035491
Time for parallel execution [with lambda] :0.036697
认为并行执行时间小于串行执行
对于所有情况来说,我很好奇为什么lambda方法时间更长
而身体对象是自编写的其他并行版本。
修改
以下是优化代码(级别-O2)的结果
**1st execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.00055
Time for parallel execution [with lambda] :1e-05
**2nd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000583
Time for parallel execution [with lambda] :9e-06
**3rd execution**
Time for serial execution :0
Time for parallel execution [without lambda] :0.000554
Time for parallel execution [with lambda] :9e-06
现在,优化的代码似乎为串行部件显示了更好的结果 而lamba部分时间有所改善。
这是否意味着始终需要使用并行代码性能进行测试 优化代码?
答案 0 :(得分:0)
这是否意味着并行代码性能始终需要使用优化代码进行测试?
任何代码性能都必须使用优化代码进行测试。您是否希望在调试期间或实际使用软件时优化代码以实现快速运行时?
您的代码中的主要问题是您的循环没有做任何工作(squarecalc
,甚至很可能serial_apply_square(int* a)
完全无法优化)并且测量的时间太短而无法充当不同结构的实际表现指标。