Question

我正在尝试使用以下代码进行比较串行和并行的表现（非lambda和lambda）。

#include<iostream>
#include<chrono>
#include <ctime>
#include<fstream>
#include<stdlib.h>
#define MAX 10000000
#include "tbb/tbb.h"
#include "tbb/task_scheduler_init.h"

using namespace std;
using namespace tbb;
void squarecalc(int a)
{
    a *= a;
}
void serial_apply_square(int* a)
{
    for (int i = 0; i<MAX; i++)
        squarecalc(*(a + i));
}

class apply_square
{
    int* my_a;
public:
    void operator()(const blocked_range<size_t>& r) const
    {
        int *a = my_a;
        for (size_t i = r.begin(); i != r.end(); ++i)
            squarecalc(a[i]);
    }
    apply_square(int* a) :my_a(a){}
};
void parallel_apply_square(int* a, size_t n)
{
    parallel_for(blocked_range<size_t>(0, n), apply_square(a));
}
void parallel_apply_square_lambda(int* a, size_t n)
{
    parallel_for(blocked_range<size_t>(0, n), [=](const blocked_range<size_t>& r)
    {
        for (size_t i = r.begin(); i != r.end(); ++i)
            squarecalc(a[i]);
    }
    );
}

int main()
{
    std::chrono::time_point<std::chrono::system_clock> start, end;
    int i = 0;
    int* a = new int[MAX];

    fstream of;
    of.open("newfile", ios::in);
    while (i<MAX)
    {
        of >> a[i];
        i++;
    }

    start = std::chrono::system_clock::now();
    serial_apply_square(a);
    end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end - start;
    cout << "\nTime for serial execution  :" << elapsed_seconds.count() << endl;

    start = std::chrono::system_clock::now();
    parallel_apply_square(a, MAX);
    end = std::chrono::system_clock::now();

    elapsed_seconds = end - start;
    cout << "\nTime for parallel execution [without lambda]  :" << elapsed_seconds.count() << endl;

    start = std::chrono::system_clock::now();
    parallel_apply_square_lambda(a, MAX);
    end = std::chrono::system_clock::now();

    elapsed_seconds = end - start;
    cout << "\nTime for parallel execution [with lambda] :" << elapsed_seconds.count() << endl;
    free(a);
}

简而言之，它只是以串行和并行方式计算10000000个数字的平方。下面是我多次执行的输出目标代码。

**1st execution**

Time for serial execution  :0.043183

Time for parallel execution [without lambda]  :0.035238

Time for parallel execution [with lambda]  :0.036719

**2nd execution**

Time for serial execution  :0.043252

Time for parallel execution [without lambda]  :0.035403

Time for parallel execution [with lambda]  :0.036811

**3rd execution**

Time for serial execution  :0.043241

Time for parallel execution [without lambda]  :0.035355

Time for parallel execution [with lambda]  :0.036558

**4th execution**

Time for serial execution  :0.043216

Time for parallel execution [without lambda]  :0.035491

Time for parallel execution [with lambda]  :0.036697

认为并行执行时间小于串行执行对于所有情况来说，我很好奇为什么lambda方法时间更长而身体对象是自编写的其他并行版本。

为什么lambda版本总是花费更多时间？
是因为编译器创建自己的主体的开销对象
如果上述问题的答案是肯定的，那就是lambda版本不如自编的版本？

修改

以下是优化代码（级别-O2）的结果

**1st execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.00055

Time for parallel execution [with lambda]  :1e-05

**2nd execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.000583

Time for parallel execution [with lambda]  :9e-06

**3rd execution**

Time for serial execution  :0

Time for parallel execution [without lambda]  :0.000554

Time for parallel execution [with lambda]  :9e-06

现在，优化的代码似乎为串行部件显示了更好的结果而lamba部分时间有所改善。

这是否意味着始终需要使用并行代码性能进行测试优化代码？

Answer 1

这是否意味着并行代码性能始终需要使用优化代码进行测试？

任何代码性能都必须使用优化代码进行测试。您是否希望在调试期间或实际使用软件时优化代码以实现快速运行时？

您的代码中的主要问题是您的循环没有做任何工作（squarecalc，甚至很可能serial_apply_square(int* a)完全无法优化）并且测量的时间太短而无法充当不同结构的实际表现指标。

TBB lambda vs自我写的身体对象

1 个答案: