我正在使用以下代码进行一些蒙特卡洛模拟。
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <sys/time.h>
#include <sys/types.h>
#include <time.h>
#define L 20
#define N 200000
int main() {
struct timespec ts_start, ts_end;
clock_t clk_start, clk_end;
struct rusage usage;
struct timeval tv_ustart, tv_uend;
struct timeval tv_sstart, tv_send;
clock_gettime(CLOCK_MONOTONIC, &ts_start);
clk_start = clock();
getrusage(RUSAGE_SELF, &usage);
tv_ustart = usage.ru_utime;
tv_sstart = usage.ru_stime;
/*
* the runtime of one simulation may vary a lot, so let's repeat it many times.
*/
for (int iteration = 0; iteration < N; iteration += 1) {
int node_num_by_depth[L + 1] = {1};
for (int level = 0; level < L; level += 1) {
for (int depth = level; depth != -1; depth -= 1) {
int parent_num = node_num_by_depth[depth];
node_num_by_depth[depth] = 0;
for (int parent = 0; parent < parent_num; parent += 1) {
int child_num = 1 + arc4random() % 2;
int child_depth = depth + (child_num > 1);
node_num_by_depth[child_depth] += child_num;
}
}
}
}
clock_gettime(CLOCK_MONOTONIC, &ts_end);
clk_end = clock();
getrusage(RUSAGE_SELF, &usage);
tv_uend = usage.ru_utime;
tv_send = usage.ru_utime;
double elapsed = (double) (ts_end.tv_sec - ts_start.tv_sec)
+ (double) (ts_end.tv_nsec - ts_start.tv_nsec) / 1E9;
printf("Wall clock time elapsed: %g\n", elapsed);
double cpu_time_used = ((double) (clk_end - clk_start)) / CLOCKS_PER_SEC;
printf("CPU time elapsed: %g\n", cpu_time_used);
printf("User CPU time elapsed: %lu.%06u\n",
tv_uend.tv_sec - tv_ustart.tv_sec,
tv_uend.tv_usec - tv_ustart.tv_usec);
printf("System CPU time elapsed: %lu.%06u\n",
tv_send.tv_sec - tv_sstart.tv_sec,
tv_send.tv_usec - tv_sstart.tv_usec);
}
使用2.3GHz dual-core Intel Core i5, Turbo Boost up to 3.6GHz, with 64MB of eDRAM
在我的MacBook Pro上运行时,输出为
Wall clock time elapsed: 32.408
CPU time elapsed: 32.3566
User CPU time elapsed: 32.319151
System CPU time elapsed: 32.319114
使用CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz (2499.96-MHz K8-class CPU)
在FreeBSD服务器上运行时,它给了我
Wall clock time elapsed: 396.563
CPU time elapsed: 396.414
User CPU time elapsed: 158.377633
System CPU time elapsed: 158.376629
在两种情况下,我使用的编译指令均为clang main.c -Ofast -march=native -o main
。
最初,该服务器运行的是Ubuntu 18.04,但在意识到我的程序运行缓慢之后,我切换到了FreeBSD。但是,即使在FreeBSD(其中arc4random
是本机提供的)上,它的效率仍然明显较低。现在,除非有很好的理由,否则我不想再切换到新的操作系统。
我已经用gprof
对程序进行了分析,您可以找到结果here。我们可以看到arc4random
花费了很多时间(30.8%),但这并不是造成性能下降十倍的唯一原因。
评论中指出,我还添加了其他一些时间度量。您会看到墙上时钟时间和CPU时间之间有巨大的差距。我可以缩小范围吗?无论如何,我关心的是执行程序需要多长时间,而不是需要多少CPU周期。
以下是有关MacBook Pro的一些扩展信息
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro14,1
Processor Name: Intel Core i5
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 8 GB
Boot ROM Version: 184.0.0.0.0
SMC Version (system): 2.43f6
Serial Number (system): XXXXXXXXXXXX
Hardware UUID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
我的服务器配备了十二个Intel Skylake Xeon Platinum 8163 2.5GHz CPU,其规格如下所示
我有很大的发现!将arc4random
替换为random
后,服务器上的性能将大大提高。
具有random
Wall clock time elapsed: 5.61316
CPU time elapsed: 5.40623
User CPU time elapsed: 5.361931
System CPU time elapsed: 5.359096
具有random
Wall clock time elapsed: 6.69183
CPU time elapsed: 6.6875
User CPU time elapsed: 6.690042
System CPU time elapsed: 6.689085