Question

我试图在我的2.4 GHz Intel Core 2 Duo CPU上用C ++对两个2D阵列的添加进行基准测试。我一遍又一遍地对数组进行求和，因此问题变为z = x + y + y + y + ...其中z，x和y都是2D数组。为了获得这个问题的大量测量，我循环遍历y的次数以及数组的大小。下面是在我的CPU上运行代码生成的日志。

Array Size: 500
Iterations: 2
n1: 500
n2: 501
count: 750
Time: 5.00391
75.0913 MegaFLOPS

Iterations: 4
n1: 500
n2: 501
count: 589
Time: 5.00125
118.006 MegaFLOPS

Iterations: 8
n1: 500
n2: 501
count: 343
Time: 5.00967 
137.209 MegaFLOPS

Iterations: 16
n1: 500
n2: 501
count: 185
Time: 5.00164
148.247 MegaFLOPS

Iterations: 32
n1: 500
n2: 501
count: 92
Time: 5.03487
146.473 MegaFLOPS

Iterations: 64
n1: 500
n2: 501
count: 48
Time: 5.01763
153.366 MegaFLOPS

Iterations: 128
n1: 500
n2: 501
count: 25
Time: 5.02799
159.428 MegaFLOPS

Iterations: 256
n1: 500
n2: 501
count: 13
Time: 5.16209 
161.497 MegaFLOPS

Iterations: 512
n1: 500
n2: 501   
count: 7
Time: 5.65551
158.747 MegaFLOPS

我的基准测试时间为5秒（时间），我的数组的第一个大小为500x501，count是在5秒窗口内完成总和的次数。

在我看来，计算的FLOPS数量非常少。下面我包括我用于基准测试的代码。在我的实际程序中，这个循环包含在另一个循环中，循环遍历数组大小（n1和n2）和迭代（iters）。

Stopwatch sw;
int maxTime = 5;
int count = 0;
sw.restart();
while (sw.getTime() < maxTime){

   for(int x = 0; x < n1; x++){
       for(int y = 0; y < n2; y++){
           array3[x][y] = array2[x][y] + array1[x][y];
               for(int k = 0; k < iters; k++){
                   array3[x][y] += array2[x][y];

                }
        }
   }        
   count++;

}
sw.stop();


std::cout << "n1: " << n1 << std::endl;
std::cout << "n2: " << n2 << std::endl;
std::cout << "count: " << count << std::endl;
std::cout << "Time: " << sw.getTime() << std::endl;

float mflops = (float)(n1*n2*count*iters*1.0e-06/sw.getTime());
std::cout << mflops << " MegaFLOPS" << std::endl;

使用Java我可以实现几乎一个GigaFLOP，所以我很困惑为什么它对我的C ++程序来说太慢了。

非常感谢任何帮助。

编辑：

以下是我用来创建性能计数器（“秒表”）的代码：

Stopwatch::Stopwatch(){
    _running=false;
    _start=0;
    _time=0;
}

void Stopwatch::start() {
    if (!_running) {
     gettimeofday(&begtime,NULL);
     _running = true;
     _start = begtime.tv_sec + begtime.tv_usec/1.0e6;
   }  
}

void Stopwatch::stop() {
    if (_running) {
     gettimeofday(&endtime,NULL);
     _time += endtime.tv_sec + endtime.tv_usec/1.0e6 - _start;
     _running = false;
   }
}

void Stopwatch::reset() {
   stop();
   _time=0; 
}

void Stopwatch::restart() {
    reset();
    start();
 }


double Stopwatch::getTime() {
    if (_running) {
      gettimeofday(&nowtime,NULL);
      return nowtime.tv_sec + nowtime.tv_usec/1.0e6 - _start;
    }
    return _time;
}

Answer 1

刚刚使用64位Ubuntu在我的Core 2 Duo上运行它。您测量的MFLOPS似乎没有优化（我得到133 MFLOPS）。使用编译选项-O3产生1600 teraflops，因为结果未使用。在打印语句中包含一个结果编号，导致530到630 MFLOPS，但是，此PC需要在省电选项中选择最大CPU MHz，并且在设置时，产生稳定的789 MFLOPS。 32位编译会有所不同。

Answer 2

我冒昧地重写你的代码，希望能更好地了解你希望实现的目标。大多数情况下，我将代码设置为运行固定数量的迭代：

for (int i = 0; i < 10000; i++) {
    for (int x = 0; x < n1; x++){
        for (int y = 0; y < n2; y++){
            array3[x][y] = array2[x][y] + array1[x][y];
            for (int k = 0; k < iters; k++)
                array3[x][y] += array2[x][y];           
        }
    }
    ++count;
}

这可能不会立即看起来像是一件好事，但我想使用OpenMP并行运行代码，并且它只能并行执行计数循环。为了启用它，我在上面的循环之前添加了这一行：

#pragma omp parallel for reduction(+:count)

然后我在编译代码时添加了-openmp，瞧，代码突然在所有可用内核上并行运行。在我的古老桌面（2.6 GHz Athlon 64X2）上，报告的速度高达1400 megaFLOPS（相比之下没有OpenMP的1060 megaFLO）。

在我的笔记本电脑（英特尔i7-3630QM）上，它的热量大约为9000兆伏（但它受热限制，所以速度取决于它运行的迭代次数 - 运行时间过长而且它会重新调整到大约7800兆瓦的速度）。即使在单核上运行，它仍然可以管理超过2800兆兆字节。

FWIW，我测试的版本的完整源代码：

#include <time.h>
#include <iostream>
#include <stdlib.h>

class Stopwatch {
    clock_t start_;
public:
    Stopwatch() : start_(clock()) {}
    double stop() { return double(clock()-start_) / CLOCKS_PER_SEC; }
};

int main() {
    static const int n1 = 500;
    static const int n2 = 501;
    static double array1[n1][n2], array2[n1][n2], array3[n1][n2];

    for (int i = 0; i < n1; i++) {
        for (int j = 0; j < n2; j++) {
            array1[i][j] = 1.0 / rand();
            array2[i][j] = 1.0 / rand();
        }
    }

    int iters = 7;

    int count = 0;
    Stopwatch sw;

#pragma omp parallel for reduction(+:count)
    for (int i = 0; i < 10000; i++) {
        for (int x = 0; x < n1; x++){
            for (int y = 0; y < n2; y++){
                array3[x][y] = array2[x][y] + array1[x][y];
                for (int k = 0; k < iters; k++)
                    array3[x][y] += array2[x][y];           
            }
        }
        ++count;
    }
    double t = sw.stop();

    std::cout << "ignore:";
    for (int i = 0; i < 10; i++)
        std::cout << array3[rand() % n1][rand() % n2] << "\t";
    std::cout << "\nQuit ignoring\n";

    std::cout << "n1: " << n1 << std::endl;
    std::cout << "n2: " << n2 << std::endl;
    std::cout << "count: " << count << std::endl;
    std::cout << "iters: " << iters << std::endl;
    std::cout << "Time: " << t << std::endl;


    double ops = 1.0e-6 * n1 * n2 * count * iters;
    double mflops = ops / t;
    std::cout << mflops << " MegaFLOPS" << std::endl;
}

用于添加两个2D阵列的低FLOPS测量

2 个答案: