Question

我在前言中说C ++不是我典型的工作领域，我更常用于C＃和Matlab。我也不会假装能够阅读x86汇编代码。最近看过一些视频虽然现代c ++＆＃34;关于最新处理器的新指令，我想我会更多地了解一下，看看我能学到什么。我确实有一些现有的C ++ DLL可以从速度改进中受益 - 这些DLL使用来自<cmath>的许多触发和电源操作。

所以我在VS2013 Express / Desktop中制作了一个简单的基准程序。我的机器上的处理器是Intel i7-4800MQ（Haswell）。程序非常简单，将一些std::vector<double>分配给500万随机条目的大小，然后循环执行组合这些值的一些数学运算。我测量在循环之前和之后使用std::chrono::high_resolution_clock::now()所花费的时间：

[编辑：包含完整的程序代码]

#include "stdafx.h"
#include <chrono>
#include <random>
#include <cmath>
#include <iostream>
#include <string>

int _tmain(int argc, _TCHAR* argv[])
{

    // Set up random number generator
    std::tr1::mt19937 eng;
    std::tr1::normal_distribution<float> dist;

    // Number of calculations to do
    uint32_t n_points = 5000000;

    // Input vectors
    std::vector<double> x1;
    std::vector<double> x2;
    std::vector<double> x3;

    // Output vectors
    std::vector<double> y1;

    // Initialize
    x1.reserve(n_points);
    x2.reserve(n_points);
    x3.reserve(n_points);
    y1.reserve(n_points);

    // Fill inputs
    for (size_t i = 0; i < n_points; i++)
    {
        x1.push_back(dist(eng));
        x2.push_back(dist(eng));
        x3.push_back(dist(eng));
    }

    // Start timer
    auto start_time = std::chrono::high_resolution_clock::now();

    // Do math loop
    for (size_t i = 0; i < n_points; i++)
    {
        double result_value; 

        result_value = std::sin(x1[i]) * x2[i] * std::atan(x3[i]);

        y1.push_back(result_value);
    }

    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
    std::cout << "Duration: " << duration.count() << " ms";

    return 0;
}

我使用标准选项（例如/ O2）将VS置于Release配置中。我使用/ arch执行一次构建：IA32并运行几次，另一次运行/ arch：AVX并运行几次。 一致地说，放置AVX选项比IA32替代选择慢约3.6倍。在这个具体示例中，与216相比调整为773毫秒。

作为一个完整性检查，我确实尝试了一些其他非常基本的操作..结合了mults并添加..将一些数字带到第8个电源......并且在两个AVX之间至少同样快，如果不是更快一点。那么为什么上面的代码会受到太多影响呢？或者我可以在哪里找到？

编辑2：根据Reddit上某人的建议，我将代码更改为更可矢量化的内容......这使得SSE2和AVX运行得更快，但AVX仍然比SSE2慢得多：< /强>

#include "stdafx.h" #include <chrono> #include <random> #include <cmath> #include <iostream> #include <string> int _tmain(int argc, _TCHAR* argv[]) { // Set up random number generator std::tr1::mt19937 eng; std::tr1::normal_distribution<double> dist; // Number of calculations to do uint32_t n_points = 5000000; // Input vectors std::vector<double> x1; std::vector<double> x2; std::vector<double> x3; // Output vectors std::vector<double> y1; // Initialize x1.reserve(n_points); x2.reserve(n_points); x3.reserve(n_points); y1.reserve(n_points); // Fill inputs for (size_t i = 0; i < n_points; i++) { x1.push_back(dist(eng)); x2.push_back(dist(eng)); x3.push_back(dist(eng)); y1.push_back(0.0); } // Start timer auto start_time = std::chrono::high_resolution_clock::now(); // Do math loop for (size_t i = 0; i < n_points; i++) { y1[i] = std::sin(x1[i]) * x2[i] * std::atan(x3[i]); } auto end_time = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time); std::cout << "Duration: " << duration.count() << " ms"; return 0; }

IA32：209 ms SSE：205毫秒 SSE2：75毫秒 AVX：371毫秒

对于Visual Studio的特定版本，这是2013 Express for Desktop Update 1（版本12.0.30110.00 Update 1）

Answer 1

当CPU在使用AVX和SSE指令之间切换时，它需要保存/恢复ymm寄存器的上半部分，并且可以产生pretty large penalty。

通常使用/arch:AVX进行编译会为您自己的代码修复此问题，因为它会尽可能使用AVX128指令而不是SSE指令。但是在这种情况下，可能是您的标准库的数学函数没有使用AVX指令实现，在这种情况下，您将获得每个函数调用的转换惩罚。你必须发布一个反汇编的版本才能确定。

你经常看到VZEROUPPER在转换之前被调用，表示CPU不需要保存寄存器的上半部分，但编译器不够智能，无法知道它调用的函数是否需要它太

Answer 2

所以基于@LưuVĩnhPhúc我调查了一下，你可以很好地进行矢量化但不使用std::vector或std::valarray，当我使用{{{I}时我也必须使用别名指针1}}否则这也会阻止矢量化。

std::unique_ptr

在使用#include <chrono> #include <random> #include <math.h> #include <iostream> #include <string> #include <valarray> #include <functional> #include <memory> #pragma intrinsic(sin, atan) int wmain(int argc, wchar_t* argv[]) { // Set up random number generator std::random_device rd; std::mt19937 eng(rd()); std::normal_distribution<double> dist; // Number of calculations to do const uint32_t n_points = 5000000; // Input vectors std::unique_ptr<double[]> x1 = std::make_unique<double[]>(n_points); std::unique_ptr<double[]> x2 = std::make_unique<double[]>(n_points); std::unique_ptr<double[]> x3 = std::make_unique<double[]>(n_points); // Output vectors std::unique_ptr<double[]> y1 = std::make_unique<double[]>(n_points); auto random = std::bind(dist, eng); // Fill inputs for (size_t i = 0; i < n_points; i++) { x1[i] = random(); x2[i] = random(); x3[i] = random(); y1[i] = 0.0; } // Start timer auto start_time = std::chrono::high_resolution_clock::now(); // Do math loop double * x_1 = x1.get(), *x_2 = x2.get(), *x_3 = x3.get(), *y_1 = y1.get(); for (size_t i = 0; i < n_points; ++i) { y_1[i] = sin(x_1[i]) * x_2[i] * atan(x_3[i]); } auto end_time = std::chrono::high_resolution_clock::now(); auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time); std::cout << "Duration: " << duration.count() << " ms"; std::cin.ignore(); return 0; }编译的机器上，这需要103毫秒，/arch:avx：252毫秒，没有设置：98毫秒

查看生成的程序集，似乎矢量函数是使用SSE实现的，因此使用它们周围的AVX指令会导致阻抗并减慢速度。希望MS将来能够实现AVX版本。

在涉及<cmath>操作的简单基准测试中，AVX比IA32慢3.6倍 - 为什么会这样？（VS2013）</CMATH>

2 个答案:

在涉及<cmath>操作的简单基准测试中，AVX比IA32慢3.6倍 - 为什么会这样？ （VS2013）</CMATH>

2 个答案:

在涉及<cmath>操作的简单基准测试中，AVX比IA32慢3.6倍 - 为什么会这样？（VS2013）</CMATH>