转置表示为ulong值的4x4矩阵(尽可能快)

时间:2018-11-12 20:03:15

标签: c# bit-manipulation 64-bit

为了实现强化学习,我一直在研究2048的C#实现。

每次移动的“滑动”操作都需要根据特定规则移动并组合磁贴。这样做涉及到二维值数组上的许多转换。

直到最近我一直使用4x4字节矩阵:

var field = new byte[4,4];

每个值都是2的指数,所以0=01=22=43=8等。 2048磁贴将由11表示。

由于给定图块的(实际)最大值为15(仅需要4位),因此可以将此4x4字节数组的内容推入ulong值。

事实证明,使用此表示,某些操作的效率大大提高。例如,我通常必须像这样反转数组:

    //flip horizontally
    const byte SZ = 4;
    public static byte[,] Invert(this byte[,] squares)
    {
        var tmp = new byte[SZ, SZ];
        for (byte x = 0; x < SZ; x++)
            for (byte y = 0; y < SZ; y++)
                tmp[x, y] = squares[x, SZ - y - 1];
        return tmp;
    }

我可以将这种反转速度提高到ulong约15倍:

    public static ulong Invert(this ulong state)
    {
        ulong c1 = state & 0xF000F000F000F000L;
        ulong c2 = state & 0x0F000F000F000F00L;
        ulong c3 = state & 0x00F000F000F000F0L;
        ulong c4 = state & 0x000F000F000F000FL;

        return (c1 >> 12) | (c2 >> 4) | (c3 << 4) | (c4 << 12);
    }

请注意使用十六进制,这非常有用,因为每个字符都代表一个图块。

我最麻烦的操作是Transpose,它翻转了2d数组中值的xy坐标,如下所示:

    public static byte[,] Transpose(this byte[,] squares)
    {
        var tmp = new byte[SZ, SZ];
        for (byte x = 0; x < SZ; x++)
            for (byte y = 0; y < SZ; y++)
                tmp[y, x] = squares[x, y];
        return tmp;
    }

我发现做到这一点的最快方法是使用以下可笑之处:

    public static ulong Transpose(this ulong state)
    {
        ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

        result |= (state & 0x0F00000000000000L) >> 12;
        result |= (state & 0x00F0000000000000L) >> 24;
        result |= (state & 0x000F000000000000L) >> 36;
        result |= (state & 0x0000F00000000000L) << 12;
        result |= (state & 0x000000F000000000L) >> 12;
        result |= (state & 0x0000000F00000000L) >> 24;
        result |= (state & 0x00000000F0000000L) << 24;
        result |= (state & 0x000000000F000000L) << 12;
        result |= (state & 0x00000000000F0000L) >> 12;
        result |= (state & 0x000000000000F000L) << 36;
        result |= (state & 0x0000000000000F00L) << 24;
        result |= (state & 0x00000000000000F0L) << 12;

        return result;
    }

令人震惊的是,它仍然比循环版本快3倍。但是,我正在寻找一种性能更高的方法,要么利用换位中固有的模式,要么更高效地管理我要移动的位。

2 个答案:

答案 0 :(得分:2)

您可以通过合并跳过6个步骤,我将它们注释掉以显示结果,应该使其速度提高一倍:

public static ulong Transpose(this ulong state)
        {
            ulong result = state & 0xF0000F0000F0000FL; //unchanged diagonals

            result |= (state & 0x0F0000F0000F0000L) >> 12;
            result |= (state & 0x00F0000F00000000L) >> 24;
            result |= (state & 0x000F000000000000L) >> 36;
            result |= (state & 0x0000F0000F0000F0L) << 12;
            //result |= (state & 0x000000F000000000L) >> 12;
            //result |= (state & 0x0000000F00000000L) >> 24;
            result |= (state & 0x00000000F0000F00L) << 24;
            //result |= (state & 0x000000000F000000L) << 12;
            //result |= (state & 0x00000000000F0000L) >> 12;
            result |= (state & 0x000000000000F000L) << 36;
            //result |= (state & 0x0000000000000F00L) << 24;
            //result |= (state & 0x00000000000000F0L) << 12;

            return result;
        }

答案 1 :(得分:1)

另一个技巧是有时可以使用乘法将不相交的位组集移动不同的数量。这就要求部分产品不能“重叠”。

例如,左移12和24可以按照以下步骤进行:

ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
r0 |= t & 0x0FF000FF000F0000UL;

将6的操作减少到4。乘法不应慢,在现代处理器上需要3个周期,并且在进行乘法运算时,处理器也可以继续进行其他步骤。作为奖励,在英特尔上,imul会进入端口1,而移位会进入端口0和6,因此节省两个移位乘以一个乘数是一个不错的选择,为其他移位打开了更多空间。 “与”和“或”运算可以进入任何ALU端口,而这在这里并不是真正的问题,但是它可能有助于延迟从属OR链的延迟:

public static ulong Transpose(this ulong state)
{
    ulong r0 = state & 0xF0000F0000F0000FL; //unchanged diagonals

    ulong t = (state & 0x0000F000FF000FF0UL) * ((1UL << 12) + (1UL << 24));
    ulong r1 = (state & 0x0F0000F0000F0000L) >> 12;
    r0 |= (state & 0x00F0000F00000000L) >> 24;
    r1 |= (state & 0x000F000000000000L) >> 36;
    r0 |= (state & 0x000000000000F000L) << 36;
    r1 |= t & 0x0FF000FF000F0000UL;

    return r0 | r1;
}