Question

在[0, 1)中生成随机统一浮点数时，TensorFlow当前使用bit twiddling将23位整数转换为[1, 2)中的浮点数，然后减去一个：

// Helper function to convert an 32-bit integer to a float between [0..1).
PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
  // IEEE754 floats are formatted as follows (MSB first):
  //    sign(1) exponent(8) mantissa(23)
  // Conceptually construct the following:
  //    sign == 0
  //    exponent == 127  -- an excess 127 representation of a zero exponent
  //    mantissa == 23 random bits
  const uint32 man = x & 0x7fffffu;  // 23 bit mantissa
  const uint32 exp = static_cast<uint32>(127);
  const uint32 val = (exp << 23) | man;

  // Assumes that endian-ness is same for float and uint32.
  float result;
  memcpy(&result, &val, sizeof(val));
  return result - 1.0f;
}

这让我感到恼火，因为减去一个意味着我们只获得23位精度，而不是24位可用。不幸的是，天真的算法在CPU上慢了约9％（它在GPU上的速度相同）：

// Helper function to convert an 32-bit integer to a float between [0..1).
PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
  return 0x1p-32f * static_cast<float>(x);
}

我还试过明确地截断到24位，以防它会教编译器舍入模式标志并不重要;这并没有解决问题：

PHILOX_DEVICE_INLINE float Uint32ToFloat(uint32 x) {
  return 0x1p-24f * static_cast<float>(x & ((1 << 24) - 1));
}

有没有办法在不牺牲性能的情况下获得全部24位可用精度？我非常确定我可以在装配中做到这一点，但是需要便携性。

请注意，对于小型浮点运算，有时可以有时的剩余8位精度并不重要：我只关心丢失的位。

Answer 1

设置第24位时，可以尝试不进行减法：

  …
  const uint32 exp = static_cast<uint32>(126); // 0.5
  …
  if ((x & 0x800000) == 0) result -= 0.5f;
  return result;
}

然而，第24位只有9％的罚款已经相当不错，而且这不一定会更快。（这里你有时会避免减法的代价，但总是付出测试和条件分支的价格。我会让你做时间：0x800000掩码可以与其余的并行完成，但条件分支的成本完全取决于实践中价值观的分布。）

通过总是进行减法然后进行条件移动，可以很容易地使GPU无分支，但编译器应该自动执行此操作。

Answer 2

您可以使用__builtin_clz直接调整指数并将数字的其余部分映射为尾数，从而避免浮点减法和精度损失：

float Uint32ToFloat(uint32_t x) {
  // IEEE754 floats are formatted as follows (MSB first):
  //    sign(1) exponent(8) mantissa(23)
  // Conceptually construct the following:
  //    sign == 0
  //    exponent == 126  -- an excess 127 representation of a -1 exponent
  //    mantissa == 23 random bits
  uint32_t exp = static_cast<uint32_t>(126);

  auto lz = __builtin_clz(x);
  exp -= lz;
  x <<= (lz+1);  // +1 to chop off implicit 1 in FP representation.
  const uint32_t man = x >> 9;  // 23 bit mantissa.
  const uint32_t val = (exp << 23) | man;

  // Assumes that endian-ness is same for float and uint32.
  float result;
  memcpy(&result, &val, sizeof(val));
  return result;
}

请注意，gcc __builtin_clz的CUDA等效值为__clz()。

优点：尽可能保持原始随机数的精确度。

缺点：我认为原始版本的矢量化效果更好，指令延迟也更少。

第三种方法是让FP硬件从整数转换后直接调整指数：

inline float Uint32ToFloat_bit(uint32_t x) {
  float f(x);
  uint32_t f_as_int;
  memcpy(&f_as_int, &f, sizeof(f_as_int));
  f_as_int -= (32 << 23);  // Subtract 32 from the exponent.
  float result;
  memcpy(&result, &f_as_int, sizeof(f_as_int));
  return result;
}

对于我来说这比builtin_clz版本快，但比你的基本速度慢，但我再次怀疑它与上下文有关。这个可以很好地矢量化 - 但是你的基础也是如此，因为它只是vmulss。

在完成所有这些之后，我认为最好的步骤是产生一批8个随机数的时序线束，然后批量转换它们，让编译器对转换进行矢量化，然后看哪个是最好的

快速，便携地将24位整数转换为浮点数而不会丢失

2 个答案: