Question

我正在写一个Langton的蚂蚁sim（用于规则字符串RLR），我正在尝试优化它以提高速度。这是相关的代码：

#define AREA_X 65536
#define AREA_Y 65536
#define TURN_LEFT 3
#define TURN_RIGHT 1
int main()
{
  uint_fast8_t* state;
  uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2);
  uint_fast8_t ant_orientation=0;
  uint_fast8_t two_pow_five=32;
  uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/
  uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT};
  int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1};
  ... initialise empty state
  while(1)
  {
    while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/
    {
      while(--two_pow_thirty_two)
      {
        /*one iteration*/
        /* 54 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3;
        state[ant] = (36 >> (state[ant] *2) ) & 3;
        ant+=move_ant[ant_orientation];
        */

        /* 47 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        state[ant] += (state[ant]==2)?-2:1;
        ant+=move_ant[ant_orientation];
        */

        /* 46 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        if(state[ant]==2)
        {
          --state[ant];
          --state[ant];
        }
        else
          ++state[ant];
        ant+=move_ant[ant_orientation];
        */

        /* 44 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3;
        if(state[ant]==3)state[ant]=0;
        ant+=move_ant[ant_orientation];
        */

        // 37 seconds for init + 2^32 steps
        // handle every situation with nested switches and constants
        switch(ant_orientation)
        {
          case 0:
            switch(state[ant])
            {
              case 0:
                ant_orientation=1;
                state[ant]=1;
                ++ant;
                break;
              case 1:
                ant_orientation=3;
                state[ant]=2;
                --ant;
                break;
              case 2:
                ant_orientation=1;
                state[ant]=0;
                ++ant;
                break;
            }
            break;
          case 1:
            switch(state[ant])
            {
              ...
            }
            break;
          case 2:
            switch(state[ant])
            {
              ...
            }
            break;
          case 3:
            switch(state[ant])
            {
              ...
            }
            break;
        }

      }
    }
    two_pow_five=32;
    ... dump to file every 2^37 steps
  }
  return 0;
}

我有两个问题：

我尝试通过试错测试尽可能优化c，是否有任何我没有利用的技巧？请尝试用c而不是汇编说话，虽然我可能会在某些时候尝试装配。
有没有更好的方法来模拟问题以提高速度？

更多信息：便携性无关紧要。我使用的是64位Linux，使用gcc，i5-2500k和16 GB内存。现状的状态阵列使用4GiB，该程序可以使用12GiB的ram。的sizeof（uint_fast8_t）= 1。故意不存在边界检查，很容易从转储中手动发现损坏。

编辑或许反直觉地说，在交换语句上堆积而不是消除它们已经产生了迄今为止最好的效率。

编辑：我已经重新建模了问题，并且每次迭代的速度比单步更快。之前，每个状态元素使用两位并描述Langton蚂蚁网格中的单个单元格。新方法使用全部8位，并描述网格的2x2部分。通过查找当前状态+方向的步数，新方向和新状态的预先计算值，每次迭代完成可变数量的步骤。假设一切都是同样可能的，平均每次迭代采取2步。作为奖励，它使用1/4的内存来模拟相同的区域：

while(--iteration)
{
        // roughly 31 seconds per 2^32 steps
        table_offset=(state[ant]*24)+(ant_orientation*3);
        it+=twoxtwo_table[table_offset+0];
        state[ant]=twoxtwo_table[table_offset+2];
        ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])];
}

尚未尝试优化它，接下来要尝试的是消除偏移方程并使用嵌套开关和常量查找（如前所述（但有648个内部案例而不是12个）。

Answer 1

或者，您可以使用单个无符号字节常量作为人工寄存器而不是分支：

value:   1  3  1  1
bits:   01 11 01 01 ---->101 decimal value for an unsigned byte
index    3  2  1  0  ---> get first 2 bits to get  "1"  (no shift)
                      --> get second 2 bits to get "1"  (shifting for 2 times)
                      --> get third 2 bits to get  "3"  (shifting for 4 times)
                      --> get last 2 bits to get   "1"  (shifting for 6 times)

 Then "AND" the result with  binary(11) or decimal(3) to get your value.

 (101>>( (++state[ant])*2  ) ) & 3 would give you the turnright or turnleft

 Example:
 ++state[ant]= 0:  ( 101>>( (0)*2 ) )&3  --> 101 & 3 = 1
 ++state[ant]= 1:  ( 101>>( (1)*2 ) )&3  --> 101>>2 & 3 = 1
 ++state[ant]= 2:  ( 101>>( (2)*2 ) )&3  --> 101>>4 & 3 = 3  -->turn left
 ++state[ant]= 3:  ( 101>>( (3)*2 ) )&3  --> 101>>6 & 3 = 1

 Maximum six-shifting + one-multiplication + one-"and" may be better.
 Dont forget constant can be auto-promoted so you may add some suffixes or something else.

由于您对％4模数使用“unsigned int”，因此可以使用“和”操作。

  state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;

对于不熟练的编译器，这应该会提高速度。

最难的部分：modulo-3

  C = A % B is equivalent to C = A – B * (A / B)
  We need  state[ant]%3
  Result = state[ant] - 3 * (state[ant]/3)

  state[ant]/3 is always <=1 for your valid direction states.
  Only when state[ant]  is 3 then state[ant]/3 is 1, other values give 0.
  When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0)
  Result = state[ant] - (0 or 3)

  Lets look at all possibilities:

  state[ant]=0:  0 - 0  ---> 0   ----> 00100100 shifted by 0 times &3 --> 00000000
  state[ant]=1:  1 - 0  ---> 1   ----> 00100100 shifted by 2 times &3 --> 00000001
  state[ant]=2:  2 - 0  ---> 2   ----> 00100100 shifted by 4 times &3 --> 00000010
  state[ant]=3:  3 - 3  ---> 0   ----> 00100100 shifted by 6 times &3 --> 00000000

  00100100 is 36 in decimal.

  (36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3)

  Example: 

  state[ant]=0: 36 >> 0  --> 36 ----> 36& 3 ----> 0  satisfies 0%3
  state[ant]=1: 36 >> 2  --> 9 -----> 9 & 3 ----> 1  satisfies 1%3
  state[ant]=2: 36 >> 4  --> 2 -----> 2 & 3 ----> 2 satisfies  2%3
  state[ant]=3: 36 >> 6  --> 0 -----> 0 & 3 ----> 0 satisfies 3%3

如何优化这个Langton的蚂蚁sim？

1 个答案: