Question

我有一个无符号的32位整数，按以下方式编码：

前6位定义opcode
接下来的8位定义register
接下来的18位是2的补码有符号整数value。

我正在使用：

解码此数字（uint32_t inst）

const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const int32_t value = ((inst >> 17) & 0x01) ? -(131072 - (inst & 0x1FFFF)) : (inst & 0x1FFFF);

我可以在解码值时测量显着的开销，我很确定这是由于三元运算符（基本上是if语句）用于比较符号和执行否定操作。

有没有办法以更快的方式执行值解码？

Answer 1

你的表达比它需要的更复杂，特别是在涉及三元运算符的不必要的情况下。以下表达式计算所有输入的相同结果，而不涉及三元运算符。^*它是替换的一个很好的候选者，但与任何优化问题一样，它必须进行测试：

const int32_t value = (int32_t)(inst & 0x1FFFF) - (int32_t)(inst & 0x20000);

或者@ doynax对类似行的建议的这种变化可能更适合优化者：

const int32_t value = (int32_t)(inst & 0x3FFFF ^ 0x20000) - (int32_t)0x20000;

在每种情况下，强制转换都避免了实现定义的行为;在许多架构上，就机器代码而言，它们将是无操作的。在这些体系结构中，这些表达式在所有情况下都比在您的情况下涉及更少的操作，更不用说是无条件的。

涉及转移的竞争性替代方案也可以很好地优化，但所有这些替代方案必然依赖于实现定义的行为，因为左移的整数溢出，负整数是右移的左手操作数，和/或转换有符号整数类型的超出范围值。您必须自己确定是否构成问题。

^* 由GCC 4.4.7针对x86_64编译。原始表达式为某些输入调用实现定义的行为，因此在其他实现中，这两个表达式可能会为这些输入计算不同的值。

Answer 2

标准（即使是非便携式）练习是左移，然后是算术右移：

const int32_t temp = inst << 14; // "shift out" the 14 unneeded bits
const int32_t value = temp >> 14; // shift the number back; sign-extend

这涉及从uint32_t到int32_t的转换以及可能为负int32_t的右移;这两个操作都是实现定义的，即不可移植（在2的补充系统上工作;几乎可以保证在任何架构上工作）。如果您希望获得最佳性能并愿意依赖于实现定义的行为，则可以使用此代码。

作为单个表达式：

const int32_t value = (int32_t)(inst << 14) >> 14;

注意：以下内容看起来更干净，通常也可以使用，但涉及未定义行为（带符号整数溢出）：

const int32_t value = (int32_t)inst << 14 >> 14;

不要使用它！（即使你可能没有收到任何关于它的警告或错误）。

Answer 3

您可以考虑使用位字段来简化代码。

typedef struct inst_type {
#ifdef MY_MACHINE_NEEDS_THIS
    uint32_t opcode :  6;
    uint32_t r1     :  8;
    int32_t  value  : 18;
#else
    int32_t  value  : 18;
    uint32_t r1     :  8;
    uint32_t opcode :  6;
#endif
} inst_type;

const uint32_t opcode = inst.opcode;
const uint32_t r1 = inst.r1;
const int32_t value = inst.value;

直接位操作通常表现更好，但并非总是如此。使用John Bollinger的答案作为基线，上述结构导致少一条指令提取GCC上的三个感兴趣的值（但更少的指令并不一定意味着更快）。

Answer 4

对于没有实现定义或未定义行为的理想编译器输出，请使用@ doynax＆＃39; s 2的补码解码表达式：

value = (int32_t)((inst & 0x3FFFF) ^ 0x20000) - (int32_t)0x20000;

强制转换确保我们执行带符号的减法，而不是使用回绕符进行无符号，然后将该位模式分配给有符号整数。

这将编译为ARM上的最佳asm，其中gcc使用sbfx r1, r1, #0, #18 (signed bitfield-extract)将位[17：0]符号扩展为完整int32_t寄存器。在x86上，它使用shl乘以14 and sar`乘以14（算术移位）来做同样的事情。这是一个明确的信号，表明gcc识别2的补码模式，并使用目标机器上最优的任何一种来对符号域进行符号扩展。

没有一种可移植的方法可以确保按照您希望的方式对位域进行排序。对于小端目标，gcc似乎为从LSB到MSB的位域进行排序，而对于大端目标，则为MSB到LSB。您可以使用#if获取具有/不包含-mbig-endian的ARM的相同asm输出，就像其他方法but there's no guarantee that other compilers work the same一样。

如果gcc / clang没有通过xor和sub看到，那么值得考虑<<14 / >>14实现，这种实现可以让编译器按照这种方式进行操作。或者考虑使用#if的签名/无符号位域方法。

但是既然我们可以通过完全安全且可移植的代码从gcc / clang中获得理想的asm，我们就应该这样做。

请参阅Godbolt Compiler Explorer上的代码，了解大多数答案的版本。您可以查看x86，ARM，ARM64或PowerPC的asm输出。

// have to put the results somewhere, so the function doesn't optimize away
struct decode {
  //unsigned char opcode, r1;
  unsigned int opcode, r1;
  int32_t value;
};
// in real code you might return the struct by value, but there's less ABI variation when looking at the ASM this way (some would pack the struct into registers)

void decode_two_comp_doynax(struct decode *result, uint32_t inst) {
  result->opcode = ((inst >> 26) & 0x3F);
  result->r1 = (inst >> 18) & 0xFF;
  result->value = ((inst & 0x3FFFF) ^ 0x20000) - 0x20000;
}

# clang 3.7.1 -O3 -march=haswell   (enables BMI1 bextr)
    mov     eax, esi
    shr     eax, 26                     # grab the top 6 bits with a shift
    mov     dword ptr [rdi], eax
    mov     eax, 2066          # (0x812)# only AMD provides bextr r32, r32, imm.  Intel has to set up the constant separately
    bextr   eax, esi, eax               # extract the middle bitfield
    mov     dword ptr [rdi + 4], eax
    shl     esi, 14                     # <<14
    sar     esi, 14                     # >>14 (arithmetic shift)
    mov     dword ptr [rdi + 8], esi
    ret

Answer 5

const uint32_t opcode = ((inst >> 26) & 0x3F);
const uint32_t r1 = (inst >> 18) & 0xFF;
const uint32_t negative = ((inst >> 17) & 0x01);
const int32_t value =  -(negative * 131072 - (inst & 0x1FFFF));

negative为1 -(131072 - (inst & 0x1FFFF))且0：-(0 - (inst & 0x1FFFF))等于inst & 0x1FFFF。

优化C中的位解码操作

5 个答案: