Question

是否有一种简单的方法可以在Delphi中对字节变量进行位反映，以便最高有效位（MSB）获得最低有效位（LSB），反之亦然？

Answer 1

在代码中你可以这样做：

function ReverseBits(b: Byte): Byte;
var 
  i: Integer;
begin
  Result := 0;
  for i := 1 to 8 do
  begin
    Result := (Result shl 1) or (b and 1);
    b := b shr 1;
  end;
end;

但是查找表会更有效率，并且只消耗256字节的内存。

function ReverseBits(b: Byte): Byte; inline;
const
  Table: array [Byte] of Byte = (
    0,128,64,192,32,160,96,224,16,144,80,208,48,176,112,240,
    8,136,72,200,40,168,104,232,24,152,88,216,56,184,120,248,
    4,132,68,196,36,164,100,228,20,148,84,212,52,180,116,244,
    12,140,76,204,44,172,108,236,28,156,92,220,60,188,124,252,
    2,130,66,194,34,162,98,226,18,146,82,210,50,178,114,242,
    10,138,74,202,42,170,106,234,26,154,90,218,58,186,122,250,
    6,134,70,198,38,166,102,230,22,150,86,214,54,182,118,246,
    14,142,78,206,46,174,110,238,30,158,94,222,62,190,126,254,
    1,129,65,193,33,161,97,225,17,145,81,209,49,177,113,241,
    9,137,73,201,41,169,105,233,25,153,89,217,57,185,121,249,
    5,133,69,197,37,165,101,229,21,149,85,213,53,181,117,245,
    13,141,77,205,45,173,109,237,29,157,93,221,61,189,125,253,
    3,131,67,195,35,163,99,227,19,147,83,211,51,179,115,243,
    11,139,75,203,43,171,107,235,27,155,91,219,59,187,123,251,
    7,135,71,199,39,167,103,231,23,151,87,215,55,183,119,247,
    15,143,79,207,47,175,111,239,31,159,95,223,63,191,127,255
  );
begin
  Result := Table[b];
end;

这比在单个位上运行的代码版本快10倍以上。

最后，当我有一个相互竞争的答案时，我通常不会对已接受的答案过于负面评论。在这种情况下，您接受的答案存在非常严重的问题，我希望为您和未来的读者清楚地说明。

你接受了@Arioch的答案，当时它包含了与本答案中相同的Pascal代码，以及两个汇编程序版本。事实证明，那些汇编程序版本比Pascal版本慢得多。它们的速度是Pascal代码的两倍。

将高级代码转换为汇编程序会导致代码更快，这是一种常见的谬误。如果执行得很糟糕，那么您可以轻松生成运行速度比编译器发出的代码慢的代码。有时候在汇编程序中编写代码是值得的，但如果没有适当的基准测试，你就不可能这样做。

在这里使用汇编程序特别令人震惊的是，基于表格的解决方案非常快，显而易见。很难想象如何才能显着改善这一点。

Answer 2

function ByteReverseLoop(b: byte): byte;
var i: integer;
begin
  Result := 0; // actually not needed, just to make compiler happy

  for i := 1 to 8 do
  begin
    Result := Result shl 1; 
    if Odd(b) then Result := Result or 1;
    b := b shr 1;
  end;
end;

如果速度很重要，那么您可以使用查找表。您在程序启动时感觉到它，然后您只需从表中获取值。由于您只需要将字节映射到字节，因此需要256x1 = 256字节的内存。鉴于最近的Delphi版本支持内联函数，这将提供速度，可读性和可靠性（在函数中包含数组查找，您可能确定不会因为某些错误而更改值）

Var ByteReverseLUT: array[byte] of byte;

function ByteReverse(b: byte): byte; inline;
begin Result := ByteReverseLUT[b] end;

{Unit/program initialization}
var b: byte;
    for b := Low(ByteReverseLUT) to High(ByteReverseLUT) 
        do ByteReverseLUT[b] := ByteReverseLoop(b);

对此论坛中提到的几种实施方式进行速度比较 AMD Phenom2 x710 / Win7 x64 / Delphi XE2 32位{$ O +}

Pascal AND original:       12494
Pascal AND reversed:       33459
Pascal IF original:        46829
Pascal IF reversed:        45585

       Asm SHIFT 1:        15802
       Asm SHIFT 2:        15490
       Asm SHIFT 3:        16212

         Asm AND 1:        19408
         Asm AND 2:        19601
         Asm AND 3:        19802

Pascal AND展开：10052 Asm Shift展开：4573 LUT，名为：3192 Pascal数学，名为：4614

http://pastebin.ca/2304708

注意：LUT（查找表）时序在这里可能相当乐观。由于在紧密循环中运行，整个表被吸入L1 CPU缓存。在实际计算中，这个函数很可能被频繁调用，并且L1缓存不会完全保留表格。

Pascal内联函数调用结果是假的 - Delphi没有调用它们，检测到它们没有副作用。但很有趣 - 时间不同。

      Asm Shift unrolled:         4040
             LUT, called:         3011
            LUT, inlined:          977
         Pascal unrolled:        10052
  Pas. unrolled, inlined:          849
     Pascal math, called:         4614
    Pascal math, inlined:         6517

在解释之下：

Project1.dpr.427: d := BitFlipLUT(i)
0044AC45 8BC3             mov eax,ebx
0044AC47 E89CCAFFFF       call BitFlipLUT

Project1.dpr.435: d := BitFlipLUTi(i)

Project1.dpr.444: d := MirrorByte(i);
0044ACF8 8BC3             mov eax,ebx
0044ACFA E881C8FFFF       call MirrorByte

Project1.dpr.453: d := MirrorByteI(i);
0044AD55 8BC3             mov eax,ebx

Project1.dpr.460: d := MirrorByte7Op(i);
0044ADA3 8BC3             mov eax,ebx
0044ADA5 E8AEC7FFFF       call MirrorByte7Op

Project1.dpr.462: d := MirrorByte7OpI(i);
0044ADF1 0FB6C3           movzx eax,bl

消除了对内联函数的所有调用。然而，关于传递参数，Delphi做出了三个不同的决定：

对于第一次调用，它消除了与函数调用一起传递的参数
对于第二次调用，它保持参数传递，尽管没有调用函数
对于第3次调用，它保持更改参数传递，证明比函数调用本身更长！奇怪的！： - ）

Answer 3

function BitFlip(B: Byte): Byte;
const
  N: array[0..15] of Byte = (0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
begin
  Result := N[B div 16] or N[B mod 16] shl 4;
end;

Answer 4

使用蛮力可以简单有效。

此例程不是 on 与David的LUT解决方案相同。

<强>更新

添加了字节数组作为输入，并将结果分配给字节数组。这表明LUT解决方案具有更好的性能。

function MirrorByte(b : Byte) : Byte;  inline;
begin
  Result :=
    ((b and $01) shl 7) or
    ((b and $02) shl 5) or
    ((b and $04) shl 3) or
    ((b and $08) shl 1) or
    ((b and $10) shr 1) or
    ((b and $20) shr 3) or
    ((b and $40) shr 5) or
    ((b and $80) shr 7);
end;

更新2

谷歌搜索了一下，发现BitReverseObvious。

function MirrorByte7Op(b : Byte) : Byte;  inline;
begin
  Result :=
    {$IFDEF WIN64}  // This is slightly better in x64 than the code in x32
    (((b * UInt64($80200802)) and UInt64($0884422110)) * UInt64($0101010101)) shr 32;
    {$ENDIF}
    {$IFDEF WIN32}
    ((b * $0802 and $22110) or (b * $8020 and $88440)) * $10101 shr 16;
    {$ENDIF}
end;

这个更接近LUT解决方案，在一次测试中更快。

总而言之， MirrorByte7Op（）在3次测试中比LUT慢5-30％，在一次测试中快5％。

基准代码：

uses
  System.Diagnostics;

const 
  cBit : Byte = $AA;
  cLoopMax = 1000000000;
var
  sw : TStopWatch;
  arrB : array of byte;
  i : Integer;

begin
  SetLength(arrB,cLoopMax);
  for i := 0 TO Length(arrB) - 1 do
    arrB[i]:= System.Random(256);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    b := b;
  end;
  sw.Stop;
  WriteLn('Loop             ',b:3,' ',sw.ElapsedMilliSeconds);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    b := ReflectBits(arrB[i]); 
  end;
  sw.Stop;
  WriteLn('RB array in:     ',b:3,' ',sw.ElapsedMilliSeconds);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    b := MirrorByte(arrB[i]);
  end;
  sw.Stop;
  WriteLn('MB array in:     ',b:3,' ',sw.ElapsedMilliSeconds);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    b := MirrorByte7Op(arrB[i]);
  end;
  sw.Stop;
  WriteLn('MB7Op array in : ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    arrB[i] := ReflectBits(arrB[i]);
  end;
  sw.Stop;
  WriteLn('RB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    arrB[i]:= MirrorByte(arrB[i]);
  end;
  sw.Stop;
  WriteLn('MB array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);

  sw := TStopWatch.StartNew;
  for i := 0 to Pred(cLoopMax) do
  begin
    arrB[i]:= MirrorByte7Op(arrB[i]);
  end;
  sw.Stop;
  WriteLn('MB7Op array in/out: ',arrB[0]:3,' ',sw.ElapsedMilliSeconds);

  ReadLn;

end.

基准测试结果（XE3，i7 CPU 870）：

                                32 bit     64 bit
--------------------------------------------------
Byte assignment (= empty loop)   599 ms    2117 ms
MirrorByte    to byte, array in 6991 ms    8746 ms
MirrorByte7Op to byte, array in 1384 ms    2510 ms
ReverseBits   to byte, array in  945 ms    2119 ms
--------------------------------------------------
ReverseBits   array in/out      1944 ms    3721 ms
MirrorByte7Op array in/out      1790 ms    3856 ms
BitFlipNibble array in/out      1995 ms    6730 ms
MirrorByte    array in/out      7157 ms    8894 ms
ByteReverse   array in/out     38246 ms   42303 ms

我在表的最后部分添加了一些其他提议（全部内联）。在带有数组和数组的循环中进行测试可能是最公平的。 ReverseBits（LUT）和MirrorByte7Op在速度上具有可比性，后跟BitFlipNibble（LUT），其在x64中的表现不佳。

注意：我为MirrorByte7Op的x64位部分添加了一个新算法。它可以更好地利用64位寄存器并减少指令。

我如何在Delphi中对一个字节进行位反映？

4 个答案: