Question

我正在测试.Net C＃System.Numerics.Vector类的功能，用于打包和解包位。

我希望Vector按位左移/右移功能但当前不可用，所以我尝试使用算术和放大器来模拟移位。逻辑方法如下。这就是我所看到的：

使用Vector.Multiply（）和Vector.BitwiseOr（）进行打包（模拟按位SHIFT LEFT和OR）比数组/指针代码稍差*。

*＆lt; 10％吞吐量降低（MB /秒）。

但使用Vector.Divide（）和Vector.BitwiseAnd（）进行解包（模拟按位SHIFT RIGHT和AND）比数组/指针代码更糟糕。

**吞吐量降低50％

注意：

使用单位测试了Vector（这也在评论中提出）。
测试基础是包装＆amp;以65536个整数的块解包100Mn到1Bn整数。我为每个块随机生成了int []。
我还测试了按位（＆amp; |＆gt;＆gt;＆lt;＆lt;）以及算术（+ - * /）操作，并且没有看到成本上的显着差异。即使分裂并不是那么糟糕，整个vs乘法只有10％的退化（评论中提出了分裂的问题）
我将原始测试代码（用于非Vector比较）更改为不安全/指针例程，以便在打包（多个整数到单词）和解包方面创建更多类似的测试（对许多整数来说）。这使得非向量代码的整个（在打包和解包之间）的差异降低到<5％的方差。（这反驳了我对下面的编译器和优化的评论）
非优化载体：打包速度是解包的2倍
优化载体：在包装中产生了4倍的改进（与非优化的载体相比），并且在拆包方面提高了2倍
非优化数组/指针：解包比打包快〜5％
优化的数组/指针：对于打包产生了3倍的改进（与非优化的数组指针相比），并且解包后提高了2.5倍。总体而言，优化的阵列/指针打包比优化的阵列/指针解包快<5％。
优化的数组/指针打包比优化的矢量包快10％

到目前为止的结论：

Vector.Divide（）似乎是一个相对较慢的实现与正常的算术分区
此外，编译器似乎没有将Vector.Divide（）代码优化到与Vector.Multiply（）相同程度的任何位置（它支持以下关于优化除法的注释）
目前，数组/指针处理的速度比Vector类稍快一些，用于打包数据，而且解包的速度要快得多
System.Numerics需要Vector.ShiftLeft（）＆amp; Vector.ShiftRight（）方法

问题（更新）;

我的结论大致正轨？还是有其他方面需要检查/考虑？

更多信息：

int numPages =  8192; // up to >15K     
int testSize = 65536;
StopWatch swPack = new StopWatch();
StopWatch swUnpack = new StopWatch();
long byteCount = 0;
for (int p = 0; p < numpages; b++)
{
    int[] data = GetRandomIntegers(testSize, 14600, 14800);

    swPack.Start();
    byte[] compressedBytes = pack(data);
    swPack.Stop();

    swUnpack.Start();
    int[] unpackedInts = unpack(compressedBytes);
    swUnpack.Stop();

    byteCount += (data.Length*4);

}
Console.WriteLine("Packing Throughput (MB/sec): " + byteCount / 1000 / swPack.ElapsedMilliseconds);
Console.WriteLine("Unpacking Throughput (MB/sec): " + byteCount / 1000 / swUnpacking.ElapsedMilliseconds);

Answer 1

IL

/// non-SIMD fallback implementation for 128-bit right-shift (unsigned)
/// n: number of bit positions to right-shift a 16-byte memory image.
/// Vector(T) argument 'v' is passed by-ref and modified in-situ.
/// Layout order of the two 64-bit quads is little-endian.

.method public static void SHR(Vector_T<uint64>& v, int32 n) aggressiveinlining
{
    ldarg v
    dup
    dup
    ldc.i4.8
    add
    ldind.i8
    ldc.i4.s 64
    ldarg n
    sub
    shl

    ldarg v
    ldind.i8
    ldarg n
    shr.un

    or
    stind.i8

    ldc.i4.8
    add
    dup
    ldind.i8
    ldarg n
    shr.un
    stind.i8

    ret
}

伪代码

As<Vector<ulong>,ulong>(ref v) = (As<Vector<ulong>,ulong>(in v) >> n) | 
                                  (ByteOffsAs<Vector<ulong>,ulong>(in v, 8) << (64 - n));
ByteOffsAs<Vector<ulong>,ulong>(ref v, 8) >>= n;

C＃外部声明

static class vector_ext
{
    [MethodImpl(MethodImplOptions.ForwardRef | MethodImplOptions.AggressiveInlining)]
    extern public static void SHR(ref Vector<ulong> v, int n);
};

您可以链接由 IL （ildasm.exe）和 C＃（csc.exe）生成的中间 .netmodule 二进制文件使用/LTCG中的link.exe（链接时代码生成）选项将它们组合成一个程序集。

运行时x64 JIT结果（.NET Framework 4.7.2）

0x7FF878F5C7E0    48 89 4C 24 08       mov qword ptr [rsp+8],rcx
0x7FF878F5C7E5    8B C2                mov eax,edx
0x7FF878F5C7E7    F7 D8                neg eax
0x7FF878F5C7E9    8D 48 40             lea ecx,[rax+40h]
0x7FF878F5C7EC    48 8B 44 24 08       mov rax,qword ptr [rsp+8]
0x7FF878F5C7F1    4C 8B 40 08          mov r8,qword ptr [rax+8]
0x7FF878F5C7F5    49 D3 E0             shl r8,cl
0x7FF878F5C7F8    4C 8B 08             mov r9,qword ptr [rax]
0x7FF878F5C7FB    8B CA                mov ecx,edx
0x7FF878F5C7FD    49 D3 E9             shr r9,cl
0x7FF878F5C800    4D 0B C1             or  r8,r9
0x7FF878F5C803    4C 89 00             mov qword ptr [rax],r8
0x7FF878F5C806    48 83 C0 08          add rax,8
0x7FF878F5C80A    8B CA                mov ecx,edx
0x7FF878F5C80C    48 D3 28             shr qword ptr [rax],cl
0x7FF878F5C80F    C3                   ret

使用C＃System..Numerics.Vector <t>解压缩/打包位

1 个答案: