优化位的重新排列

时间:2015-11-18 21:30:22

标签: c# optimization bit-manipulation

我有一个核心C#函数,我正在努力加快速度。涉及安全或不安全代码的建议同样受欢迎。这是方法:

public byte[] Interleave(uint[] vector)
{
    var byteVector = new byte[BytesNeeded + 1]; // Extra byte needed when creating a BigInteger, for sign bit.
    foreach (var idx in PrecomputedIndices)
    {
        var bit = (byte)(((vector[idx.iFromUintVector] >> idx.iFromUintBit) & 1U) << idx.iToByteBit);
        byteVector[idx.iToByteVector] |= bit;
    }
    return byteVector;
}

PrecomputedIndices是以下类的数组:

class Indices
{
    public readonly int iFromUintVector;
    public readonly int iFromUintBit;
    public readonly int iToByteVector;
    public readonly int iToByteBit;

    public Indices(int fromUintVector, int fromUintBit, int toByteVector, int toByteBit)
    {
        iFromUintVector = fromUintVector;
        iFromUintBit = fromUintBit;
        iToByteVector = toByteVector;
        iToByteBit = toByteBit;
    }
}

Interleave方法的目的是将uints数组中的位复制到字节数组。我已经预先计算了源和目标数组索引以及源和目标位数,并将它们存储在Indices对象中。源中的两个相邻位在目标中不会相邻,因此排除了某些优化。为了让您了解规模,我正在处理的问题大约有4,200个维度,因此“vector”有4,200个元素。向量中的值范围从0到12,因此我只需要使用4位将它们的值存储在字节数组中,因此我需要4,200 x 4 = 16,800位数据,或每个向量2,100字节的输出。这种方法将被调用数百万次。在我需要优化的较大程序中,它消耗了大约三分之一的时间。

更新1:将“Indices”更改为结构并缩小一些数据类型,使对象只有8个字节(int,short和2个字节),将执行时间的百分比从35%减少到30 %。

1 个答案:

答案 0 :(得分:0)

这些是我修订实施的关键部分,其中的意见来自评论者:

1)将对象转换为struct,将数据类型缩小为较小的int,然后重新排列,以使对象适合64位值,这对于64位计算机更好:

struct Indices
{
    /// <summary>
    /// Index into source vector of source uint to read.
    /// </summary>
    public readonly int iFromUintVector;

    /// <summary>
    /// Index into target vector of target byte to write.
    /// </summary>
    public readonly short iToByteVector;

    /// <summary>
    /// Index into source uint of source bit to read.
    /// </summary>
    public readonly byte iFromUintBit;

    /// <summary>
    /// Index into target byte of target bit to write.
    /// </summary>
    public readonly byte iToByteBit;

    public Indices(int fromUintVector, byte fromUintBit, short toByteVector, byte toByteBit)
    {
        iFromUintVector = fromUintVector;
        iFromUintBit = fromUintBit;
        iToByteVector = toByteVector;
        iToByteBit = toByteBit;
    }
}

2)对PrecomputedIndices进行排序,以便按升序编写每个目标字节和位,从而改善内存缓存访问:

    Comparison<Indices> sortByTargetByteAndBit = (a, b) =>
    {
        if (a.iToByteVector < b.iToByteVector) return -1;
        if (a.iToByteVector > b.iToByteVector) return 1;
        if (a.iToByteBit < b.iToByteBit) return -1;
        if (a.iToByteBit > b.iToByteBit) return 1;
        return 0;
    };
    Array.Sort(PrecomputedIndices, sortByTargetByteAndBit);

3)展开循环,以便立即汇编整个目标字节,减少我访问目标数组的次数:

public byte[] Interleave(uint[] vector)
{
    var byteVector = new byte[BytesNeeded + 1]; // An extra byte is needed to hold the extra bits and a sign bit for the BigInteger.
    var extraBits = Bits - BytesNeeded << 3;
    int iIndex = 0;
    var iByte = 0;
    for (; iByte < BytesNeeded; iByte++)
    {
        // Unroll the loop so we compute the bits for a whole byte at a time.
        uint bits = 0;
        var idx0 = PrecomputedIndices[iIndex];
        var idx1 = PrecomputedIndices[iIndex + 1];
        var idx2 = PrecomputedIndices[iIndex + 2];
        var idx3 = PrecomputedIndices[iIndex + 3];
        var idx4 = PrecomputedIndices[iIndex + 4];
        var idx5 = PrecomputedIndices[iIndex + 5];
        var idx6 = PrecomputedIndices[iIndex + 6];
        var idx7 = PrecomputedIndices[iIndex + 7];
        bits = (((vector[idx0.iFromUintVector] >> idx0.iFromUintBit) & 1U))
             | (((vector[idx1.iFromUintVector] >> idx1.iFromUintBit) & 1U) << 1)
             | (((vector[idx2.iFromUintVector] >> idx2.iFromUintBit) & 1U) << 2)
             | (((vector[idx3.iFromUintVector] >> idx3.iFromUintBit) & 1U) << 3)
             | (((vector[idx4.iFromUintVector] >> idx4.iFromUintBit) & 1U) << 4)
             | (((vector[idx5.iFromUintVector] >> idx5.iFromUintBit) & 1U) << 5)
             | (((vector[idx6.iFromUintVector] >> idx6.iFromUintBit) & 1U) << 6)
             | (((vector[idx7.iFromUintVector] >> idx7.iFromUintBit) & 1U) << 7);
        byteVector[iByte] = (Byte)bits;
        iIndex += 8;
    }
    for (; iIndex < PrecomputedIndices.Length; iIndex++)
    {
        var idx = PrecomputedIndices[iIndex];
        var bit = (byte)(((vector[idx.iFromUintVector] >> idx.iFromUintBit) & 1U) << idx.iToByteBit);
        byteVector[idx.iToByteVector] |= bit;
    }
    return byteVector;
}

1将功能从占执行时间的35%减少到执行时间的30%(节省14%)。

2没有加速功能,但使#3成为可能。

3将功能从执行时间的30%减少到19.6%,另外减少33%。

总节省:44%!!!