该程序在2.5GHz Xeon上的运行速度比2.3GHz Core i5慢10倍

时间:2019-03-20 15:12:35

标签: c performance optimization cpu scientific-computing

我正在使用以下代码进行一些蒙特卡洛模拟。

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <time.h>

#define L 20
#define N 200000


int main() {
    struct timespec ts_start, ts_end;
    clock_t clk_start, clk_end;
    struct rusage usage;
    struct timeval tv_ustart, tv_uend;
    struct timeval tv_sstart, tv_send;

    clock_gettime(CLOCK_MONOTONIC, &ts_start);
    clk_start = clock();
    getrusage(RUSAGE_SELF, &usage);
    tv_ustart = usage.ru_utime;
    tv_sstart = usage.ru_stime;

    /* 
     * the runtime of one simulation may vary a lot, so let's repeat it many times.
     */
    for (int iteration = 0; iteration < N; iteration += 1) {
        int node_num_by_depth[L + 1] = {1};

        for (int level = 0; level < L; level += 1) {
            for (int depth = level; depth != -1; depth -= 1) {
                int parent_num = node_num_by_depth[depth];
                node_num_by_depth[depth] = 0;
                for (int parent = 0; parent < parent_num; parent += 1) {
                    int child_num = 1 + arc4random() % 2;
                    int child_depth = depth + (child_num > 1);
                    node_num_by_depth[child_depth] += child_num;
                }
            }
        }
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_end);
    clk_end = clock();
    getrusage(RUSAGE_SELF, &usage);
    tv_uend = usage.ru_utime;
    tv_send = usage.ru_utime;


    double elapsed = (double) (ts_end.tv_sec - ts_start.tv_sec)
                     + (double) (ts_end.tv_nsec - ts_start.tv_nsec) / 1E9;
    printf("Wall clock time elapsed: %g\n", elapsed);

    double cpu_time_used = ((double) (clk_end - clk_start)) / CLOCKS_PER_SEC;
    printf("CPU time elapsed: %g\n", cpu_time_used);

    printf("User CPU time elapsed: %lu.%06u\n",
           tv_uend.tv_sec - tv_ustart.tv_sec,
           tv_uend.tv_usec - tv_ustart.tv_usec);

    printf("System CPU time elapsed: %lu.%06u\n",
           tv_send.tv_sec - tv_sstart.tv_sec,
           tv_send.tv_usec - tv_sstart.tv_usec);
}

使用2.3GHz dual-core Intel Core i5, Turbo Boost up to 3.6GHz, with 64MB of eDRAM在我的MacBook Pro上运行时,输出为

Wall clock time elapsed: 32.408
CPU time elapsed: 32.3566
User CPU time elapsed: 32.319151
System CPU time elapsed: 32.319114

使用CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (2499.96-MHz K8-class CPU)在FreeBSD服务器上运行时,它给了我

Wall clock time elapsed: 396.563
CPU time elapsed: 396.414
User CPU time elapsed: 158.377633
System CPU time elapsed: 158.376629

在两种情况下,我使用的编译指令均为clang main.c -Ofast -march=native -o main

最初,该服务器运行的是Ubuntu 18.04,但在意识到我的程序运行缓慢之后,我切换到了FreeBSD。但是,即使在FreeBSD(其中arc4random是本机提供的)上,它的效率仍然明显较低。现在,除非有很好的理由,否则我不想再切换到新的操作系统。

我已经用gprof对程序进行了分析,您可以找到结果here。我们可以看到arc4random花费了很多时间(30.8%),但这并不是造成性能下降十倍的唯一原因。


评论中指出,我还添加了其他一些时间度量。您会看到墙上时钟时间和CPU时间之间有巨大的差距。我可以缩小范围吗?无论如何,我关心的是执行程序需要多长时间,而不是需要多少CPU周期。


以下是有关MacBook Pro的一些扩展信息

Hardware Overview:

  Model Name: MacBook Pro
  Model Identifier: MacBookPro14,1
  Processor Name: Intel Core i5
  Processor Speed: 2.3 GHz
  Number of Processors: 1
  Total Number of Cores: 2
  L2 Cache (per Core): 256 KB
  L3 Cache: 4 MB
  Memory: 8 GB
  Boot ROM Version: 184.0.0.0.0
  SMC Version (system): 2.43f6
  Serial Number (system): XXXXXXXXXXXX
  Hardware UUID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

我的服务器配备了十二个Intel Skylake Xeon Platinum 8163 2.5GHz CPU,其规格如下所示

enter image description here


我有很大的发现!将arc4random替换为random后,服务器上的性能将大大提高。

  • 具有random

    的MacBook Pro
    Wall clock time elapsed: 5.61316
    CPU time elapsed: 5.40623
    User CPU time elapsed: 5.361931
    System CPU time elapsed: 5.359096
    
  • 具有random

    的服务器
    Wall clock time elapsed: 6.69183
    CPU time elapsed: 6.6875
    User CPU time elapsed: 6.690042
    System CPU time elapsed: 6.689085
    

0 个答案:

没有答案