Question

我没有找到关于这个主题的任何明确的基准，所以我做了一个。我会在这里发布，以防任何人像我一样寻找这个。

我有一个问题。不是SSE应该比循环中的四个fpu RSQRT快4倍吗？它更快但仅仅1.5倍。转移到SSE寄存器有这么大的影响，因为我没有做很多计算，但只有rsqrt？或者是因为SSE rsqrt更加精确，我如何找到rsqrt的迭代次数？两个结果：

4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
4 SSE align16 float[4]  RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07

修改

在AMD Athlon II X2 270上使用MSVC 11 /GS- /Gy /fp:fast /arch:SSE2 /Ox /Oy- /GL /Oi进行编译

测试代码：

#include <iostream>
#include <chrono>
#include <th/thutility.h>

int main(void)
{
    float i;
    //long i;
    float res;
    __declspec(align(16)) float var[4] = {0};

    auto t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
        res = sqrt(i);
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 float SQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, res);
         res *= i;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 float RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << res << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, var[0]);
         var[0] *= i;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 align16 float[4] RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " <<  var[0] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, var[0]);
         var[0] *= i;
         thutility::math::rsqrt(i, var[1]);
         var[1] *= i + 1;
         thutility::math::rsqrt(i, var[2]);
         var[2] *= i + 2;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "3 align16 float[4] RSQRT: "
        << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " 
        << var[0] << " - " << var[1] << " - " << var[2] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
         thutility::math::rsqrt(i, var[0]);
         var[0] *= i;
         thutility::math::rsqrt(i, var[1]);
         var[1] *= i + 1;
         thutility::math::rsqrt(i, var[2]);
         var[2] *= i + 2;
         thutility::math::rsqrt(i, var[3]);
         var[3] *= i + 3;
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "4 align16 float[4] RSQRT: "
        << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " 
        << var[0] << " - " << var[1] << " - " << var[2] << " - " << var[3] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
        var[0] = i;
        __m128& cache = reinterpret_cast<__m128&>(var);
        __m128 mmsqrt = _mm_rsqrt_ss(cache);
        cache = _mm_mul_ss(cache, mmsqrt);
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "1 SSE align16 float[4]  RSQRT: " << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()
        << "us " << var[0] << std::endl;

    t1 = std::chrono::high_resolution_clock::now();
    for(i = 0; i < 5000000; i+=1)
    {
        var[0] = i;
        var[1] = i + 1;
        var[2] = i + 2;
        var[3] = i + 3;
        __m128& cache = reinterpret_cast<__m128&>(var);
        __m128 mmsqrt = _mm_rsqrt_ps(cache);
        cache = _mm_mul_ps(cache, mmsqrt);
    }
    t2 = std::chrono::high_resolution_clock::now();
    std::cout << "4 SSE align16 float[4]  RSQRT: "
        << std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count() << "us " << var[0] << " - "
        << var[1] << " - " << var[2] << " - " << var[3] << std::endl;

    system("PAUSE");
}

结果使用 float 类型：

1 float SQRT: 24996us 2236.07
1 float RSQRT: 28003us 2236.07
1 align16 float[4] RSQRT: 32004us 2236.07
3 align16 float[4] RSQRT: 51013us 2236.07 - 2236.07 - 5e+006
4 align16 float[4] RSQRT: 87011us 2236.07 - 2236.07 - 2236.07 - 2236.07
1 SSE align16 float[4]  RSQRT: 46999us 2236.07
4 SSE align16 float[4]  RSQRT: 60008us 2236.07 - 2236.07 - 2236.07 - 2236.07

除非我们对不少于4个变量进行计算，否则我的结论不是值得用SSE2打扰。（也许这只适用于rsqrt，但这是一个昂贵的计算（它还包括多次乘法），所以它也可能适用于其他计算）

同样sqrt（x）比x * rsqrt（x）快两次迭代，而x * rsqrt（x）一次迭代对于距离计算来说太不准确了。

所以我在某些主板上看到x * rsqrt（x）比sqrt（x）快的语句是错误的。因此，除非你直接需要1 / x ^（1/2），否则使用rsqrt而不是sqrt是不合逻辑的并且不值得精确损失。

尝试没有SSE2标志（如果它在正常的rsqrt循环上应用了SSE，它给出了相同的结果）。

我的RSQRT是quake rsqrt的修改（相同）版本。

namespace thutility
{
    namespace math
    {
        void rsqrt(const float& number, float& res)
        {
              const float threehalfs = 1.5F;
              const float x2 = number * 0.5F;

              res = number;
              uint32_t& i = *reinterpret_cast<uint32_t *>(&res);    // evil floating point bit level hacking
              i  = 0x5f3759df - ( i >> 1 );                             // what the fuck?
              res = res * ( threehalfs - ( x2 * res * res ) );   // 1st iteration
              res = res * ( threehalfs - ( x2 * res * res ) );   // 2nd iteration, this can be removed
        }
    }
}

Answer 1

在SSE代码中很容易获得大量不必要的开销。

如果要确保代码有效，请查看编译器的反汇编。通常会导致性能下降的一件事（看起来它可能会影响到你）不必要地在内存和SSE寄存器之间移动数据。

在循环中，您应该将所有相关数据以及结果保存在SSE寄存器中，而不是float[4]中。

只要您访问内存，请验证编译器是否生成对齐的移动指令以将数据加载到寄存器中或将其写回到数组中。

并检查生成的SSE指令是否没有很多不必要的移动指令以及它们之间的其他内容。有些编译器在从内在函数生成SSE代码时非常糟糕，因此关注它生成的代码是值得的。

最后，您需要查阅CPU的手册/规范，以确保它实际执行您使用的压缩指令，就像标量指令一样快。（对于现代CPU，我认为它们会这样做，但是一些较旧的CPU至少需要一些额外的时间来打包指令。不是标量的四倍，但足以让你无法达到4倍的加速）

Answer 2

除非我们对不少于4个变量进行计算，否则我的结论并非值得用SSE2打扰。（也许这只适用于rsqrt，但这是一个昂贵的计算（它还包括多次乘法），所以它也可能适用于其他计算）

同样sqrt（x）比x * rsqrt（x）快两次迭代，而x * rsqrt（x）一次迭代对于距离计算来说太不准确了。

所以我在某些主板上看到x * rsqrt（x）比sqrt（x）快的语句是错误的。因此，除非你直接需要1 / x ^（1/2），否则使用rsqrt而不是sqrt是不合逻辑的并且不值得精确损失。

SQRT vs RSQRT vs SSE _mm_rsqrt_ps基准

2 个答案: