压缩桌子

Question

我想仅使用位操作来连接两个整数，因为我需要尽可能多的效率。有各种各样的答案可用，但它们不够快我想要的只是使用像左移等位操作的实现。请指导我怎么做。

例如

int x=32;
int y=12;
int result=3212;

我正在和FPGA实现AES。我需要在我的系统上使用它来减少某些任务的时间消耗

Answer 1

最有效的方法可能与此类似：

uint32_t uintcat (uint32_t ms, uint32_t ls)
{
  uint32_t mult=1;

  do
  {
    mult *= 10; 
  } while(mult <= ls);

  return ms * mult + ls;
}

然后让编译器担心优化。可能没有太多可以改进的，因为它是基数10，它与计算机的各种指令不能很好地融合，比如移位。

编辑：基准测试

Intel i7-3770 2 3,4 GHz
OS: Windows 7/64
Mingw, GCC version 4.6.2
gcc -O3 -std=c99 -pedantic-errors -Wall

10 million random values, from 0 to 3276732767.

结果（近似值）：

Algorithm 1: 60287 micro seconds
Algorithm 2: 65185 micro seconds

使用的基准代码：

#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#include <time.h>

uint32_t uintcat (uint32_t ms, uint32_t ls)
{
  uint32_t mult=1;

  do
  {
    mult *= 10; 
  } while(mult <= ls);

  return ms * mult + ls;
}


uint32_t myConcat (uint32_t a, uint32_t b) {
    switch( (b >= 10000000) ? 7 : 
            (b >= 1000000) ? 6 : 
            (b >= 100000) ? 5 : 
            (b >= 10000) ? 4 : 
            (b >= 1000) ? 3 : 
            (b >= 100) ? 2 : 
            (b >= 10) ? 1 : 0 ) {
        case 1: return a*100+b; break;
        case 2: return a*1000+b; break;
        case 3: return a*10000+b; break;
        case 4: return a*100000+b; break;
        case 5: return a*1000000+b; break;
        case 6: return a*10000000+b; break;
        case 7: return a*100000000+b; break;

        default: return a*10+b; break;
    }
}


static LARGE_INTEGER freq;

static void print_benchmark_results (LARGE_INTEGER* start, LARGE_INTEGER* end)
{
  LARGE_INTEGER elapsed;

  elapsed.QuadPart = end->QuadPart - start->QuadPart;
  elapsed.QuadPart *= 1000000;
  elapsed.QuadPart /= freq.QuadPart;

  printf("%lu micro seconds", elapsed.QuadPart);
}

int main()
{
  const uint32_t TEST_N = 10000000;
  uint32_t* data1 = malloc (sizeof(uint32_t) * TEST_N);
  uint32_t* data2 = malloc (sizeof(uint32_t) * TEST_N);
  volatile uint32_t* result_algo1 = malloc (sizeof(uint32_t) * TEST_N);
  volatile uint32_t* result_algo2 = malloc (sizeof(uint32_t) * TEST_N);

  srand (time(NULL));
  // Mingw rand() apparently gives numbers up to 32767
  // worst case should therefore be 3,276,732,767

  // fill up random data in arrays
  for(uint32_t i=0; i<TEST_N; i++)
  {
    data1[i] = rand();
    data2[i] = rand();
  }


  QueryPerformanceFrequency(&freq); 


  LARGE_INTEGER start, end;

  // run algorithm 1
  QueryPerformanceCounter(&start);
  for(uint32_t i=0; i<TEST_N; i++)
  {
    result_algo1[i] = uintcat(data1[i], data2[i]);
  } 
  QueryPerformanceCounter(&end);

  // print results
  printf("Algorithm 1: ");
  print_benchmark_results(&start, &end);
  printf("\n");

  // run algorithm 2
  QueryPerformanceCounter(&start);
  for(uint32_t i=0; i<TEST_N; i++)
  {
    result_algo2[i] = myConcat(data1[i], data2[i]);
  } 
  QueryPerformanceCounter(&end);

  // print results
  printf("Algorithm 2: ");
  print_benchmark_results(&start, &end);
  printf("\n\n");


  // sanity check both algorithms against each other
  for(uint32_t i=0; i<TEST_N; i++)
  {
    if(result_algo1[i] != result_algo2[i])
    {
      printf("Results mismatch for %lu %lu. Expected: %lu%lu, algo1: %lu, algo2: %lu\n",
             data1[i], 
             data2[i],
             data1[i],
             data2[i],
             result_algo1[i],
             result_algo2[i]);
    }
  }


  // clean up
  free((void*)data1);
  free((void*)data2);
  free((void*)result_algo1);
  free((void*)result_algo2);
}

Answer 2

位操作使用数字的二进制表示。但是，您尝试实现的是以十进制表示法连接数字。请注意，连接十进制表示与连接二进制表示几乎没有关系。虽然理论上可以使用二进制运算来解决问题，但我相信它远非最有效的方式。

Answer 3

我们需要非常快地计算出一个* 10 ^ N + b。

比特操作不是优化它的最佳选择（甚至使用诸如：=（a＆lt;＆lt; 1）+（a＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆lt;＆＃;）＆gt; a：= a * 10等技巧作为编译器可以自己做。）

第一个问题是计算10 ^ N，但没有必要计算它，只有9个可能的值。

第二个问题是从b计算N（长度为10表示）。如果您的数据具有统一分布，则可以在平均情况下最小化操作计数。

检查b＆lt; = 10 ^ 9，b＆lt; = 10 ^ 8，...，b＆lt; = 10 with（）？:(它比优化后的if（）更快，它具有更简单的语法和功能），调用结果N.接下来，使用行“返回a * 10 ^ N + b”（其中10 ^ N是常数）使开关（N）。据我所知，switch（）3-4“case”比优化后的if（）构造快。

unsigned int myConcat(unsigned int& a, unsigned int& b) {
    switch( (b >= 10000000) ? 7 : 
            (b >= 1000000) ? 6 : 
            (b >= 100000) ? 5 : 
            (b >= 10000) ? 4 : 
            (b >= 1000) ? 3 : 
            (b >= 100) ? 2 : 
            (b >= 10) ? 1 : 0 ) {
        case 1: return a*100+b; break;
        case 2: return a*1000+b; break;
        case 3: return a*10000+b; break;
        case 4: return a*100000+b; break;
        case 5: return a*1000000+b; break;
        case 6: return a*10000000+b; break;
        case 7: return a*100000000+b; break;
        default: return a*10+b; break;
        // I don't really know what to do here
        //case 8: return a*1000*1000*1000+b; break;
        //case 9: return a*10*1000*1000*1000+b; break;
    }
}

正如您所看到的，平均情况下有2-3次操作+优化在这里非常有效。与Lundin的建议here is the result相比，我对它进行了基准测试。 0ms vs 100ms

Answer 4

如果您关心十进制数字级联，则可能需要在打印时简单地执行此操作，然后将两个数字依次转换为数字序列。例如How do I print an integer in Assembly Level Programming without printf from the c library?显示了有效的C函数以及asm。在同一个缓冲区中调用两次。

@Lundin的答案循环增加10的查找能力正确的十进制移位，即线性搜索正确的10的幂。如果经常调用它，以便查找表可以在高速缓存中保持高温，则可能会加速。

如果您可以使用GNU C __builtin_clz（计算前导零）或其他快速查找右侧输入的MSB位置的方法（ls，则该数字的最低有效部分结果连接），您可以从32个条目的查找表中开始搜索正确的mult。（而且您最多只需再检查一次迭代，因此它不是循环。）< / p>

大多数常见的现代CPU体系结构都有HW指令，编译器可以直接使用HW指令，也可以使用一点处理来实现clz。 https://en.wikipedia.org/wiki/Find_first_set#Hardware_support。（在x86以外的所有语言上，输入0都明确定义了结果，但是不幸的是GNU C并不能使我们对此进行访问。）

如果表在L1d高速缓存中保持高温，这可能很好。 clz和表查找的额外延迟相当于循环的几次迭代（例如，在现代x86（如Skylake或Ryzen）上，其中bsf或tzcnt为3个周期延迟，L1d延迟为4或5个周期，imul延迟为3个周期。）

当然，在许多体系结构（包括x86）上，使用shift和add乘以10比运行时变量便宜。 x86上的2条LEA指令，或ARM / AArch64上的add + lsl，使用移位输入对加法执行tmp = x + x*4。因此，在Intel CPU上，我们只查看的是2循环循环依赖关系链，而不是3。但是，使用缩放索引时，AMD CPU的LEA较慢。

对于小数字来说听起来并不好。但是，它最多需要一次迭代就可以减少分支的错误预测。它甚至可以实现无分支的实现。而且这意味着较大的下部零件（10的大功率）的总工作量较少。但是大整数很容易溢出，除非您使用更广泛的结果类型。

不幸的是，10并不是2的幂，因此仅MSB位置不能给我们确切的10的幂。例如从64到127的所有数字均具有MSB = 1<<7，但其中一些具有2个十进制数字，而另一些具有3进制。因为我们要避免除法（因为它需要乘以魔术常数并乘以高半部）），我们总是要从10的较低幂开始，看看是否足够大。

但是幸运的是，位扫描确实可以使我们处于10的幂数之内，因此我们不再需要循环。

如果我事先了解了避免输入= 0的问题的_lzcnt_u32技巧，那么我可能不会用__clz或ARM clz(a|1)编写该零件。但是我做到了，并尝试使用源代码尝试从gcc和clang获得更好的asm。根据目标平台在clz或BSR上对表进行索引会使它有些混乱。

#include <stdint.h>
#include <limits.h>
#include <assert.h>

   // builtin_clz matches Intel's docs for x86 BSR: garbage result for input=0
   // actual x86 HW leaves the destination register unmodified; AMD even documents this.
   // but GNU C doesn't let us take advantage with intrinsics.
   // unless you use BMI1 _lzcnt_u32


// if available, use an intrinsic that gives us a leading-zero count
// *without* an undefined result for input=0
#ifdef __LZCNT__      // x86 CPU feature
#include <immintrin.h>  // Intel's intrinsics
#define HAVE_LZCNT32
#define lzcnt32(a) _lzcnt_u32(a)
#endif

#ifdef __ARM__      // TODO: do older ARMs not have this?
#define HAVE_LZCNT32
#define lzcnt32(a) __clz(a)  // builtin, no header needed
#endif

// Some POWER compilers define `__cntlzw`?



// index = msb position, or lzcnt, depending on which the HW can do more efficiently
// defined later; one or the other is unused and optimized out, depending on target platform
// alternative: fill this at run-time startup
// with a loop that does mult*=10 when (x<<1)-1 > mult, or something
//#if INDEX_BY_MSB_POS == 1
  __attribute__((unused))
  static const uint32_t catpower_msb[] = {
       10,    // 1 and 0
       10,    // 2..3
       10,    // 4..7
       10,    // 8..15
       100,    // 16..31     // 2 digits even for the low end of the range
       100,    // 32..63
       100,    // 64..127
       1000,   // 128..255   // 3 digits
       1000,   // 256..511
       1000,   // 512..1023
       10000,   // 1024..2047
       10000,   // 2048..4095
       10000,   // 4096..8191
       10000,   // 8192..16383
       100000,   // 16384..32767
       100000,   // 32768..65535      // up to 2^16-1, enough for 16-bit inputs
       //  ...   // fill in the rest yourself
  };
//#elif INDEX_BY_MSB_POS == 0
  // index on leading zeros
  __attribute__((unused))
  static const uint32_t catpower_lz32[] = {
      // top entries overflow: 10^10 doesn't fit in uint32_t
      // intentionally wrong to make it easier to spot bad output.
    4000000000,    // 2^31 .. 2^32-1    2*10^9 .. 4*10^9
    2000000000,    // 1,073,741,824 .. 2,147,483,647
    // first correct entry
    1000000000,    //   536,870,912 .. 1,073,741,823

    // ... fill in the rest
    // for testing, skip until 16 leading zeros
    [16] = 100000,   // 32768..65535      // up to 2^16-1, enough for 16-bit inputs
       100000,   // 16384..32767
       10000,   // 8192..16383
       10000,   // 4096..8191
       10000,   // 2048..4095
       10000,   // 1024..2047
       1000,   // 512..1023
       1000,   // 256..511
       1000,   // 128..255
       100,    // 64..127
       100,    // 32..63
       100,    // 16..31     // low end of the range has 2 digits
       10,    // 8..15
       10,    // 4..7
       10,    // 2..3
       10,    // 1
                       // lzcnt32(0) == 32
       10,    // 0     // treat 0 as having one significant digit.
  };
//#else
//#error "INDEX_BY_MSB_POS not set correctly"
//#endif



//#undef HAVE_LZCNT32  // codegen for the other path, for fun

static inline uint32_t msb_power10(uint32_t a)
{
#ifdef HAVE_LZCNT32  // 0-safe lzcnt32 macro available
    #define INDEX_BY_MSB_POS 0
    // a |= 1 would let us shorten the table, in case 32*4 is a lot nicer than 33*4 bytes
    unsigned lzcnt = lzcnt32(a);  // 32 for a=0
    return catpower_lz32[lzcnt];
#else
  // only generic __builtin_clz available

  static_assert(sizeof(uint32_t) == sizeof(unsigned) && UINT_MAX == (1ULL<<32)-1, "__builtin_clz isn't 32-bit");
  // See also https://foonathan.net/blog/2016/02/11/implementation-challenge-2.html
  // for C++ templates for fixed-width wrappers for __builtin_clz

  #if defined(__i386__) || defined(__x86_64__)
    // x86 where MSB_index = 31-clz = BSR is most efficient
    #define INDEX_BY_MSB_POS 1
    unsigned msb = 31 - __builtin_clz(a|1);  // BSR
    return catpower_msb[msb];
    //return unlikely(a==0) ? 10 : catpower_msb[msb];
  #else
    // use clz directly while still avoiding input=0
    // I think all non-x86 CPUs with hardware CLZ do define clz(0) = 32 or 64 (the operand width),
    // but gcc's builtin is still documented as not valid for input=0
    // Most ISAs like PowerPC and ARM that have a bitscan instruction have clz, not MSB-index

    // set the LSB to avoid the a==0 special case
    unsigned clz = __builtin_clz(a|1);
    // table[32] unused, could add yet another #ifdef for that
    #define INDEX_BY_MSB_POS 0
    //return unlikely(a==0) ? 10 : catpower_lz32[clz];
    return catpower_lz32[clz];   // a|1 avoids the special-casing
  #endif  // optimize for BSR or not
#endif // HAVE_LZCNT32
}


uint32_t uintcat (uint32_t ms, uint32_t ls)
{
//  if (ls==0) return ms * 10;  // Another way to avoid the special case for clz

  uint32_t mult = msb_power10(ls); // catpower[clz(ls)];
  uint32_t high = mult * ms;
#if 0
  if (mult <= ls)
      high *= 10;
  return high + ls;
#else
  // hopefully compute both and then select
  // because some CPUs can shift and add at the same time (x86, ARM)
  // so this avoids having an ADD *after* the cmov / csel, if the compiler is smart
  uint32_t another10 = high*10 + ls;
  uint32_t enough = high + ls; 
  return (mult<=ls) ? another10 : enough;
#endif
}

From the Godbolt compiler explorer ，它可以在带有和不带有BSR的x86-64上高效编译：

# clang7.0 -O3 for x86-64 SysV,  -march=skylake -mno-lzcnt
uintcat(unsigned int, unsigned int):
    mov     eax, esi
    or      eax, 1
    bsr     eax, eax                    # 31-clz(ls|1)
    mov     ecx, dword ptr [4*rax + catpower_msb]
    imul    edi, ecx                    # high = mult * ms
    lea     eax, [rdi + rdi]
    lea     eax, [rax + 4*rax]          # retval = high * 10
    cmp     ecx, esi
    cmova   eax, edi                    # if(mult>ls) retval = high   (drop the *10 result)
    add     eax, esi                    # retval += ls
    ret

或和 lzcnt（由-march=haswell或更高版本启用，或某些AMD uarches）

uintcat(unsigned int, unsigned int):
          # clang doesn't try to break the false dependency on EAX; gcc uses xor eax,eax
    lzcnt   eax, esi                    # C source avoids the |1, saving instructions
    mov     ecx, dword ptr [4*rax + catpower_lz32]
    imul    edi, ecx                    # same as above from here on
    lea     eax, [rdi + rdi]
    lea     eax, [rax + 4*rax]
    cmp     ecx, esi
    cmova   eax, edi
    add     eax, esi
    ret

在三元数的两边都设置最后一个add是一个错过的优化，在cmov之后增加了1个周期的延迟。在Intel CPU上，我们可以乘以10，并便宜地乘以10。

    ... same start         # hand-optimized version that clang should use
    imul    edi, ecx                    # high = mult * ms
    lea     eax, [rdi + 4*rdi]          # high * 5
    lea     eax, [rsi + rdi*2]          # retval = high * 10 + ls
    add     edi, esi                    # tmp = high + ls
    cmp     ecx, esi
    cmova   eax, edi                    # if(mult>ls) retval = high+ls
    ret

因此high + ls延迟将与high*10 + ls延迟并行运行，这两者都是cmov的输入。

GCC分支而不是最后一个条件使用CMOV。 GCC还会造成31-clz(a|1)混乱，用clz计算BSR并用31计算XOR，但是从31中减去。mov还有一些额外的lzcnt说明。奇怪的是，即使31-clz可用，gcc似乎也能更好地使用BSR代码。

clang可以轻松优化uintcat: .Lfunc_gep0: addis 2, 12, .TOC.-.Lfunc_gep0@ha addi 2, 2, .TOC.-.Lfunc_gep0@l ori 6, 4, 1 # OR immediate addis 5, 2, catpower_lz32@toc@ha cntlzw 6, 6 # CLZ word addi 5, 5, catpower_lz32@toc@l # static table address rldic 6, 6, 2, 30 # rotate left and clear immediate (shift and zero-extend the CLZ result) lwzx 5, 5, 6 # Load Word Zero eXtend, catpower_lz32[clz] mullw 3, 5, 3 # mul word cmplw 5, 4 # compare mult, ls mulli 6, 3, 10 # mul immediate isel 3, 3, 6, 1 # conditional select high vs. high*10 add 3, 3, 4 # + ls clrldi 3, 3, 32 # zero extend, clearing upper 32 bits blr # return双重反转，而无需直接使用BSR。

对于PowerPC64，clang还会创建无分支的asm。 gcc的功能类似，但在x86-64上具有类似的分支。

clz(ls|1) >> 1

压缩桌子

使用mult = clz(ls) >= 18 ? 100000 : 10;或+1应该起作用，因为4 <10。该表始终至少需要3个条目才能获得另一个数字。我还没有对此进行调查。（并且已经花了比我预期更长的时间：: P）

或者右移更多以获取循环的起点。例如if或mult *= 100的3或4链。

或在old_mult * 10上循环，退出该循环后，请选择是否要使用mult或ls。（即检查您是否走得太远）。这样就可以减少偶数位数的迭代次数。

（请注意大mult *= 100上可能出现的无限循环，该循环会溢出结果。如果<= ls换为0，它将始终保持ls = 1000000000例如{{1}}。

使用位操作的两个整数的十进制级联

4 个答案:

压缩桌子