Question

是否有任何已知的优化用于乘以已知为2 ^ x-1（1,3,7 ...）的几个（3到5）字节（int8）

这是在使用（2 ^ x-1）/ 2 ^ x多次乘以字节数组的上下文中。除法是微不足道的（为右移添加指数）但分子有点麻烦。

此外，指数x仅在1..31中，且总和x始终小于32。

// In reality there are 16 of these (i.e. a[16], b[16], c[16])
// ( a + b + c ) < 32
char  a = 2;
char  b = 16;
char  c = 8;

// Ratio/scale, there are 16 of these (i.e. r[16])
// It might work storing in log2 and using int8 or int16
// with fixed point approximation
<x?>  r = ( a - 1 ) * ( b - 1 ) * ( c - 1 ) / ( a * b * c );

// Big original value, just one
int   v = 1234567890;
// This might be done by scaling down to log2, too
// it is used for a comparison only
// doesn't need full 32b precission
// This is also 16 values, of course (i.e. rv[16])
int  rv = v * r;

Answer 1

坦率地说，这个函数不适合AVX指令集，它缺少整数运算。由SSE2或AVX2提供的直接整数左移几乎肯定是最快的方法。不过，根据你对Aleksander Z的回答来判断，我收集到的答案是你想要评估替代方法。

将此问题强行转移到AVX设备上需要我们使用IEEE-754 representation数字进行创作。通过未对齐的加载和按位掩码，我们可以将单个字节值混洗到32位浮点数的最顶层字节中，其中指数定义数字的2 ^ n幂。

这几乎为我们提供了所需的幂函数，除了我们缺少指数字段的最低有效位并且需要使用平方根来调整它。同样，我们还需要通过乘法设置指数偏差。

无论如何，请查看下面的代码以获取详细信息，因为在这里逐字逐句地重复评论没有什么意义。请注意在数组之前未对齐读取（但忽略）最多三个字节，因此请根据需要添加填充。另请注意，结果字是交错的，result1存储字节{0,4,8,12，..}等等。

哦，显然结果将近似于使用浮点运算的结果。

void compute(const unsigned char (*ptr)[32], size_t len) { const __m256 mask = _mm256_castsi256_ps(_mm256_set1_epi32(0x3F000000U)); const __m256 normalize = _mm256_castsi256_ps(_mm256_set1_epi32(0x7F000000U)); const __m256 offset = _mm256_set1_ps(1); __m256 result1 = _mm256_set1_ps(1); __m256 result2 = _mm256_set1_ps(1); __m256 result3 = _mm256_set1_ps(1); __m256 result4 = _mm256_set1_ps(1); do { // Mask out every forth byte into a separate variable using unaligned // loads to simulate 8-to-32 bit integer unpacking __m256 real1 = _mm256_loadu_ps((const float *) &ptr[0][-3]); __m256 real2 = _mm256_loadu_ps((const float *) &ptr[0][-2]); __m256 real3 = _mm256_loadu_ps((const float *) &ptr[0][-1]); __m256 real4 = _mm256_loadu_ps((const float *) &ptr[0][-0]); real1 = _mm256_and_ps(real1, mask); real2 = _mm256_and_ps(real2, mask); real3 = _mm256_and_ps(real3, mask); real4 = _mm256_and_ps(real4, mask); // The binary values are 2^2x * 2^-BIAS once the masked-once top bytes // are interpreted as IEEE-754 floating-point exponent bytes. // Unfortunately we are overshooting the exponent field by one bit, // hence the doubled exponents. Anyway, let's at least multiply the // bias away real1 = _mm256_mul_ps(real1, normalize); real2 = _mm256_mul_ps(real2, normalize); real3 = _mm256_mul_ps(real3, normalize); real4 = _mm256_mul_ps(real4, normalize); // Use a fast aproximate reciprocal square root to halve the exponent, // yielding ~1/2^x. // You'd think this case of the reciprocal lookup table would be // precise, yet it seems not to be. Perhaps twiddling the rounding // mode or biasing the values may make it so. real1 = _mm256_rsqrt_ps(real1); real2 = _mm256_rsqrt_ps(real2); real3 = _mm256_rsqrt_ps(real3); real4 = _mm256_rsqrt_ps(real4); // Compute (2^x-1)/2^x as 1-1/2^x real1 = _mm256_sub_ps(offset, real1); real2 = _mm256_sub_ps(offset, real2); real3 = _mm256_sub_ps(offset, real3); real4 = _mm256_sub_ps(offset, real4); // Finally multiply the running products result1 = _mm256_mul_ps(result1, real1); result2 = _mm256_mul_ps(result2, real2); result3 = _mm256_mul_ps(result3, real3); result4 = _mm256_mul_ps(result4, real4); } while(++ptr, --len); /* * Do something useful with result1..4 here */ }

Answer 2

并不像以下那样简单：

a * (2^x - 1) = (a << x) - a

Answer 3

我所看到的是（与你上次计算有点相反）：

(2^a-1)(2^b-1)(2^c-1)=2^(a+b+c)-2^(a+b)-2^(b+c)-2^(a+c)
                              + 2^a + 2^b + 2^c - 1

请注意，扩展中的所有术语都是2的幂，所有指数＆lt; 32根据你的约束。当然，所有这些可能的术语中的32个都可以预先计算好＃34;。然后，这是仅仅总结2 ^ j个这样的术语的问题（由约束3 <= j <= 5）。根据我的统计，对于j = 3的情况，对于abc，7＆＃34;查找＆＃34;以及为术语增加了7。我不知道这是否比仅仅进行3＆＃34;查找＆＃34; （2 ^ x-1）和2倍（咬子弹）给你......

另请注意：可以通过2^y-1次移位和(y-1)次加法将任意乘以系数(y-1)。假设指数为a,b,c,d,e且a为最大值，(b+c+d+e-4)移位且(b+c+d+e-4)添加（从2^a-1开始）。

Answer 4

您是否考虑过使用简单的预先计算查找表？如果我正确理解您的问题，/root，x0和x1始终在1到31之间，并且可以存储在5位中，因此只有x2个组合。这意味着可以使用几个位移和按位OR来计算2^15 = 32768，以便在相当小的表中计算索引和单个查找。

当然，此表查找无法进行矢量化。

优化元素2 ^ x-1的乘法

4 个答案: