Question

已经广泛讨论了C ++向量和普通数组之间的性能差异，例如here和here。通常讨论得出结论，当使用[]运算符访问并且编译器启用内联函数时，向量和数组在性能方面类似。这就是预期的原因，但我遇到的情况似乎并非如此。以下几行的功能非常简单：采用3D体积并交换并应用某种3D小面具一定次数。根据{{1}}宏，卷将被声明为向量，并通过VERSION运算符（at）进行访问，声明为向量并通过VERSION=2访问（{{1} }）或声明为简单数组。

[]

可以预期代码将与VERSION=1和#include <vector> #define NX 100 #define NY 100 #define NZ 100 #define H 1 #define C0 1.5f #define C1 0.25f #define T 3000 #if !defined(VERSION) || VERSION > 2 || VERSION < 0 #error "Bad version" #endif #if VERSION == 2 #define AT(_a_,_b_) (_a_.at(_b_)) typedef std::vector<float> Field; #endif #if VERSION == 1 #define AT(_a_,_b_) (_a_[_b_]) typedef std::vector<float> Field; #endif #if VERSION == 0 #define AT(_a_,_b_) (_a_[_b_]) typedef float* Field; #endif #include <iostream> #include <omp.h> int main(void) { #if VERSION != 0 Field img(NX*NY*NY); #else Field img = new float[NX*NY*NY]; #endif double end, begin; begin = omp_get_wtime(); const int csize = NZ; const int psize = NZ * NX; for(int t = 0; t < T; t++ ) { /* Swap the 3D volume and apply the "blurring" coefficients */ #pragma omp parallel for for(int j = H; j < NY-H; j++ ) { for( int i = H; i < NX-H; i++ ) { for( int k = H; k < NZ-H; k++ ) { int eindex = k+i*NZ+j*NX*NZ; AT(img,eindex) = C0 * AT(img,eindex) + C1 * (AT(img,eindex - csize) + AT(img,eindex + csize) + AT(img,eindex - psize) + AT(img,eindex + psize) ); } } } } end = omp_get_wtime(); std::cout << "Elapsed "<< (end-begin) <<" s." << std::endl; /* Access img field so we force it to be deleted after accouting time */ #define WHATEVER 12.f if( img[ NZ ] == WHATEVER ) { std::cout << "Whatever" << std::endl; } #if VERSION == 0 delete[] img; #endif }执行相同的操作，但输出如下：

版本2：经过6.94905秒。
版本1：经历了4.08626 s
VERSION 0：经历了1.97576秒。

如果我在没有OMP的情况下编译（我只有两个核心），我会得到类似的结果：

版本2：经过10.9895秒。
版本1：经过7.14674秒
VERSION 0：经过3.25336秒。

我总是使用GCC 4.6.3和编译选项VERSION=1进行编译（我当然在没有omp的情况下编译时删除了VERSION=0）是否存在我做错的事情，例如编译时？或者我们真的应该期待向量和数组之间的差异吗？

PS：我不能使用std :: array，因为我依赖的编译器不支持C11标准。使用ICC 13.1.2，我得到了类似的行为。

Answer 1

我尝试了你的代码，用chrono计算时间。

我用clang（版本3.5）和libc ++编译。

clang ++ test.cc -std = c ++ 1y -stdlib = libc ++ -lc ++ abi -finline-functions -O3

对于VERSION 0和VERSION 1，结果完全相同，没有太大区别。它们平均为3.4秒（我使用的是虚拟机，所以速度较慢。）。

然后我尝试了g ++（版本4.8.1），

g ++ test.cc -std = c ++ 1y -finline-functions -O3

结果显示，对于VERSION 0，它是4.4秒（粗略地），对于VERSION 1，它是5.2秒（粗略地）。

然后我尝试使用libstdc ++克服++。

clang ++ test.cc -std = c ++ 11 -finline-functions -O3

瞧，结果又回到了3.4秒。

所以，这纯粹是g ++的优化“bug”。

C ++中的向量和数组

1 个答案: