我正在写一个Langton的蚂蚁sim(用于规则字符串RLR),我正在尝试优化它以提高速度。这是相关的代码:
#define AREA_X 65536
#define AREA_Y 65536
#define TURN_LEFT 3
#define TURN_RIGHT 1
int main()
{
uint_fast8_t* state;
uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2);
uint_fast8_t ant_orientation=0;
uint_fast8_t two_pow_five=32;
uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/
uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT};
int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1};
... initialise empty state
while(1)
{
while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/
{
while(--two_pow_thirty_two)
{
/*one iteration*/
/* 54 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3;
state[ant] = (36 >> (state[ant] *2) ) & 3;
ant+=move_ant[ant_orientation];
*/
/* 47 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
state[ant] += (state[ant]==2)?-2:1;
ant+=move_ant[ant_orientation];
*/
/* 46 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
if(state[ant]==2)
{
--state[ant];
--state[ant];
}
else
++state[ant];
ant+=move_ant[ant_orientation];
*/
/* 44 seconds for init + 2^32 steps
ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3;
if(state[ant]==3)state[ant]=0;
ant+=move_ant[ant_orientation];
*/
// 37 seconds for init + 2^32 steps
// handle every situation with nested switches and constants
switch(ant_orientation)
{
case 0:
switch(state[ant])
{
case 0:
ant_orientation=1;
state[ant]=1;
++ant;
break;
case 1:
ant_orientation=3;
state[ant]=2;
--ant;
break;
case 2:
ant_orientation=1;
state[ant]=0;
++ant;
break;
}
break;
case 1:
switch(state[ant])
{
...
}
break;
case 2:
switch(state[ant])
{
...
}
break;
case 3:
switch(state[ant])
{
...
}
break;
}
}
}
two_pow_five=32;
... dump to file every 2^37 steps
}
return 0;
}
我有两个问题:
我尝试通过试错测试尽可能优化c,是否有任何我没有利用的技巧?请尝试用c而不是汇编说话,虽然我可能会在某些时候尝试装配。
有没有更好的方法来模拟问题以提高速度?
更多信息:便携性无关紧要。我使用的是64位Linux,使用gcc,i5-2500k和16 GB内存。现状的状态阵列使用4GiB,该程序可以使用12GiB的ram。的sizeof(uint_fast8_t)= 1。故意不存在边界检查,很容易从转储中手动发现损坏。
编辑或许反直觉地说,在交换语句上堆积而不是消除它们已经产生了迄今为止最好的效率。
编辑:我已经重新建模了问题,并且每次迭代的速度比单步更快。之前,每个状态元素使用两位并描述Langton蚂蚁网格中的单个单元格。新方法使用全部8位,并描述网格的2x2部分。通过查找当前状态+方向的步数,新方向和新状态的预先计算值,每次迭代完成可变数量的步骤。假设一切都是同样可能的,平均每次迭代采取2步。作为奖励,它使用1/4的内存来模拟相同的区域:
while(--iteration)
{
// roughly 31 seconds per 2^32 steps
table_offset=(state[ant]*24)+(ant_orientation*3);
it+=twoxtwo_table[table_offset+0];
state[ant]=twoxtwo_table[table_offset+2];
ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])];
}
尚未尝试优化它,接下来要尝试的是消除偏移方程并使用嵌套开关和常量查找(如前所述(但有648个内部案例而不是12个)。
答案 0 :(得分:3)
或者,您可以使用单个无符号字节常量作为人工寄存器而不是分支:
value: 1 3 1 1
bits: 01 11 01 01 ---->101 decimal value for an unsigned byte
index 3 2 1 0 ---> get first 2 bits to get "1" (no shift)
--> get second 2 bits to get "1" (shifting for 2 times)
--> get third 2 bits to get "3" (shifting for 4 times)
--> get last 2 bits to get "1" (shifting for 6 times)
Then "AND" the result with binary(11) or decimal(3) to get your value.
(101>>( (++state[ant])*2 ) ) & 3 would give you the turnright or turnleft
Example:
++state[ant]= 0: ( 101>>( (0)*2 ) )&3 --> 101 & 3 = 1
++state[ant]= 1: ( 101>>( (1)*2 ) )&3 --> 101>>2 & 3 = 1
++state[ant]= 2: ( 101>>( (2)*2 ) )&3 --> 101>>4 & 3 = 3 -->turn left
++state[ant]= 3: ( 101>>( (3)*2 ) )&3 --> 101>>6 & 3 = 1
Maximum six-shifting + one-multiplication + one-"and" may be better.
Dont forget constant can be auto-promoted so you may add some suffixes or something else.
由于您对%4模数使用“unsigned int”,因此可以使用“和”操作。
state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;
对于不熟练的编译器,这应该会提高速度。
最难的部分:modulo-3
C = A % B is equivalent to C = A – B * (A / B)
We need state[ant]%3
Result = state[ant] - 3 * (state[ant]/3)
state[ant]/3 is always <=1 for your valid direction states.
Only when state[ant] is 3 then state[ant]/3 is 1, other values give 0.
When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0)
Result = state[ant] - (0 or 3)
Lets look at all possibilities:
state[ant]=0: 0 - 0 ---> 0 ----> 00100100 shifted by 0 times &3 --> 00000000
state[ant]=1: 1 - 0 ---> 1 ----> 00100100 shifted by 2 times &3 --> 00000001
state[ant]=2: 2 - 0 ---> 2 ----> 00100100 shifted by 4 times &3 --> 00000010
state[ant]=3: 3 - 3 ---> 0 ----> 00100100 shifted by 6 times &3 --> 00000000
00100100 is 36 in decimal.
(36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3)
Example:
state[ant]=0: 36 >> 0 --> 36 ----> 36& 3 ----> 0 satisfies 0%3
state[ant]=1: 36 >> 2 --> 9 -----> 9 & 3 ----> 1 satisfies 1%3
state[ant]=2: 36 >> 4 --> 2 -----> 2 & 3 ----> 2 satisfies 2%3
state[ant]=3: 36 >> 6 --> 0 -----> 0 & 3 ----> 0 satisfies 3%3