Question

在将整数写入十六进制字符串函数时，我注意到我有一个不必要的掩码和位移，但当我删除它时，代码实际上变大了（大约8倍）

char *i2s(int n){
    static char buf[(sizeof(int)<<1)+1]={0};
    int i=0;
    while(i<(sizeof(int)<<1)+1){    /* mask the ith hex, shift it to lsb */
//      buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
        buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
        if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
    }
    for(i=0;buf[i++]=='0';)
        /*find first non-zero*/;
    return (char *)buf+i;
}

使用额外的位移和掩码并使用gcc -S -O3编译，循环展开并减少为：

    movb    $48, buf.1247
    xorl    %eax, %eax
    movb    $48, buf.1247+1
    movb    $48, buf.1247+2
    movb    $48, buf.1247+3
    movb    $48, buf.1247+4
    movb    $48, buf.1247+5
    movb    $48, buf.1247+6
    movb    $48, buf.1247+7
    movb    $48, buf.1247+8
    .p2align 4,,7
    .p2align 3
.L26:
    movzbl  buf.1247(%eax), %edx
    addl    $1, %eax
    cmpb    $48, %dl
    je  .L26
    addl    $buf.1247, %eax
    ret

我对32位x86的预期是什么（应该是类似的，但对于64位而言是两倍于movb的op）;然而，如果没有看似冗余的掩码和位移，gcc似乎无法展开并优化它。

为什么会发生这种情况的任何想法？我猜它与gcc有关（过度？）对符号位谨慎。（C中没有＆gt;＆gt;＆gt;运算符，所以如果设置了符号位，则将MSB＆gt;＆gt;填充位移1s与0s匹配）

Answer 1

我认为它必须在较短的版本中执行此操作，您将左移（（sizeof（int）＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆＃2）然后右移相同的值表达式，因此编译器能够根据这一事实进行优化。

关于右移，C ++可以完全相同于Java的两个运算符＆＃39;＆gt;＆gt;＆＃39;和＆＃39;＆gt;＆gt;＆gt;＆＃39;。它只是在[GNU] C ++中＆＃34; x＆gt;＆gt;的结果。 ý＆＃34;将取决于x是签名还是未签名。如果x被签名，则使用右移算术（SRA，符号扩展），如果x是无符号的，则使用shift-right-logical（SRL，零扩展）。这样，＆gt;＆gt;对于负数和正数，可以用2除以。

展开循环不再是一个好主意，因为：1）较新的处理器带有微操作缓冲区，通常会加速小循环，2）代码膨胀使得指令缓存在L1i中占用更多空间效率更低。微基准测试将隐藏这种影响。

算法不必那么复杂。此外，您的算法有一个问题，它返回＆＃39; 0＆＃39; 0对于16的倍数，对于0本身，它返回一个空字符串。

下面是algo的重写，除了循环退出检查外，它是无分支的（如果编译器决定展开它，则完全分支）。它更快，生成更短的代码并修复了16个错误的错误。

无分支代码是可取的，因为如果CPU错误预测分支，则会有很大的影响（15-20个时钟周期）。将其与算法中的位操作进行比较：它们每个只需1个时钟周期，CPU可以在同一个时钟周期内执行3个或4个。

const char* i2s_brcfree(int n)
{
  static char buf[ sizeof(n)*2+1] = {0};
  unsigned int nibble_shifter = n;
  for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
    const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
    char digit = '0' + curr_nibble;
    // "promote" to hex if nibble is over 9, 
    // conditionally adding the difference between ('0'+nibble) and 'A' 
    enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
    digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
    *p = digit;
  }
  return buf;
}

编辑：C ++没有定义右移负数的结果。我只知道GCC和visual studio在x86架构上做到了这一点。

Answer 2

It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.

gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.

clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.

Obviously this means your function is buggy! And that's not the only bug.

off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).

clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i))); That compiles to pretty clunky code, though. gcc probably does better with @Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)

In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.

It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).

I put a bugfixed version on godbolt, along with your original.

Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)

gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!

i2s_orig:
    mov     BYTE PTR buf.1406+3, 48
    mov     BYTE PTR buf.1406, 48
    cmp     BYTE PTR buf.1406+3, 48
    mov     BYTE PTR buf.1406+1, 48
    mov     BYTE PTR buf.1406+2, 48
    mov     BYTE PTR buf.1406+4, 48
    mov     BYTE PTR buf.1406+5, 48
    mov     BYTE PTR buf.1406+6, 48
    mov     BYTE PTR buf.1406+7, 48
    mov     BYTE PTR buf.1406+8, 48
    mov     BYTE PTR buf.1406+9, 0
    jne     .L7    # testing flags from the compare earlier
    jne     .L8
    jne     .L9
    jne     .L10
    jne     .L11
    sete    al
    movzx   eax, al
    add     eax, 8
.L3:
    add     eax, OFFSET FLAT:buf.1406
    ret
.L7:
    mov     eax, 3
    jmp     .L3
 ... more of the same, setting eax to 4, or 5, etc.

Putting multiple jne instructions in a row is obviously useless.

为什么添加多余的掩码和bitshift更可优化？

2 个答案: