Question

There is not a optimal -O level. My approach in order to find the fastest execution for my particular code is to compile the same code with usual optimization levels (i.e. -O0, -Ofast, -O1, -O2, -O3,-march=native) and check which flags produce me the fastest execution (with time).

So, there is a way to check all optimization levels (listed before) running a Makefile with each optimization (-O level)?

I think that Gnu Parallel could run the Makefile changing the -O level but I don't know how figure it out?

Thanks in advance.

Answer 1

您想使用GNU Parallel并行执行多个构建吗？至少需要单独的构建目录，如果要避免复制整个源代码目录，则需要更复杂的构建设置。如果您尝试同时在同一目录中执行多个单独的构建，则某些目标文件将使用一组CFLAGS构建，而其他目标文件将构建为其他目标文件。

使用@ Etan的循环建议：

NJOBS=$(getconf _NPROCESSORS_ONLN)  # adjust as desired
for flag in -O{0..3} -O{3,fast}" -march=native"; do
    make clean
    make -j"$NJOBS" CFLAGS+="$flag -fprofile-generate"
    ./a.out  # feed it some input that exercises different options and code paths
    make clean
    make -j"$NJOBS" CFLAGS+="$flag -fprofile-use"
    perf stat ./a.out | tee "perfstat$flag.txt"
done

注意make -j用于并行，而不是GNU并行。另请注意使用配置文件引导优化。 x264有一个构建系统，其make fprofiled目标用于构建PGO可执行文件，它负责构建/运行/重建周期。所以它是可能的，但IDK如果它让Makefile变得混乱。

你可以使用GNU parallel进行代码的计时运行，但是如果你在空闲机器上运行计时，你将得到更一致的结果。

如果你想测试你的代码在同时运行多个副本时的作用，竞争缓存空间和内存带宽（甚至是超线程的执行资源），那么用多个副本来测试< em>相同的代码，没有一些运行与gcc竞争，一些运行-O0，一些运行-O3。

就优化选项而言，通常会通过-fprofile-generate and -fprofile-use选项从gcc获得最佳结果。 Clang也可以使用相同的选项或使用CPU性能计数器中的数据来profile-guided optimization。（该手册描述了使用工具将Linux perf record数据转换为Clang可以使用的数据。）

某些gcc优化只能通过-fprofile-use启用（或手动启用，而不只启用-O3）。例如-funroll-loops可以在一些紧密循环中提供帮助。不要用于所有事情，因为较大的代码大小可能导致整个程序中的I-cache总体错过，这超过了减少某些热循环中的循环开销所带来的收益。

GNU Parallel running Makefiles with different optimization levels

1 个答案: