Question

我被要求对更大的程序进行矢量化。在我开始使用大型程序之前，我想在单独的情况下看到矢量化的效果。为此，我创建了两个应该展示出色转型理念的程序。一个结构数组（没有vec）和数组结构（带vec）。我预计到目前为止，soa的表现会超过aos，但事实并非如此。

测量程序循环A

for (int i = 0; i < NUM; i++) {
    ptr[i].c = ptr[i].a + ptr[i].b;
}

完整的计划：

#include <cstdlib>
#include <iostream>
#include <stdlib.h>

#include <chrono>
using namespace std;
using namespace std::chrono;


struct myStruct {
    double a, b, c;
};
#define NUM 100000000

high_resolution_clock::time_point t1, t2, t3;

int main(int argc, char* argsv[]) {
    struct myStruct *ptr = (struct myStruct *) malloc(NUM * sizeof(struct myStruct));

    for (int i = 0; i < NUM; i++) {
        ptr[i].a = i;
        ptr[i].b = 2 * i;
    }
    t1 = high_resolution_clock::now();
    for (int i = 0; i < NUM; i++) {
        ptr[i].c = ptr[i].a + ptr[i].b;
    }
    t2 = high_resolution_clock::now();
    long dur = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "took "<<dur << endl;
    double sum = 0;
    for (int i = 0; i < NUM; i++) {
        sum += ptr[i].c;
    }
    cout << "sum is "<< sum << endl;

}

测量程序循环B

#pragma simd 
for (int i = 0; i < NUM; i++) {
    C[i] = A[i] + B[i];
}

完整的计划：

#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <chrono>

using namespace std;
using namespace std::chrono;

#define NUM 100000000

high_resolution_clock::time_point t1, t2, t3;

int main(int argc, char* argsv[]) {
    double *A = (double *) malloc(NUM * sizeof(double));
    double *B = (double *) malloc(NUM * sizeof(double));
    double *C = (double *) malloc(NUM * sizeof(double));
    for (int i = 0; i < NUM; i++) {
        A[i] = i;
        B[i] = 2 * i;
    }


    t1 = high_resolution_clock::now();
    #pragma simd
    for (int i = 0; i < NUM; i++) {
        C[i] = A[i] + B[i];
    }
    t2 = high_resolution_clock::now();
    long dur = duration_cast<microseconds>( t2 - t1 ).count();
    cout << "Aos "<<dur << endl;

    double sum = 0;
    for (int i = 0; i < NUM; i++) {
        sum += C[i];
    }
    cout << "sum "<<sum;
}

我用

编译

icpc vectorization_aos.cpp -qopenmp --std=c++11 -cxxlib=/lrz/mnt/sys.x86_64/compilers/gcc/4.9.3/

icpc（v16）在英特尔（R）Xeon（R）CPU E5-2697 v3 @ 2.60GHz上编译和执行

在我的测试用例中，程序A需要大约300毫秒，B 350毫秒。如果我在A中向结构添加不必要的附加数据，它会变得越来越慢（因为需要加载更多的内存） -O3标志对运行时没有任何影响删除#pragma simd指令也没有影响。所以要么它的自动矢量化，要么我的矢量化根本不起作用。

问题：

我错过了什么吗？这是一个如何将程序矢量化的方式吗？
为什么程序2会变慢？也许程序只是内存带宽限制，我需要增加计算密度？
是否有更好的程序/代码片段能够更好地显示出来的影响，以及如何验证我的程序是否实际执行了矢量化。

向量化程序会增加运行时间

测量程序循环A

测量程序循环B

0 个答案: