Question

我有这个我正在为学校工作的程序，其目的是添加两个矩阵并将结果存储在第三个矩阵中。目前，当使用驱动程序（一个.o文件）运行时，指令数为1,003,034,420，但它需要不到10亿。但是，我不知道该如何做到这一点，因为我已经考虑了我使用的所有指令，并且所有这些指令似乎都是使程序工作所必需的。

请注意，此时我无法减少循环展开的指令数量。

以下是该计划：

/* This function has 5 parameters, and the declaration in the
   C-language would look like:

   void matadd (int **C, int **A, int **B, int height, int width)

   C, A, B, and height will be passed in r0-r3, respectively, and
   width will be passed on the stack. */

.arch armv7-a
.text
.align  2
.global matadd
.syntax unified
.arm
matadd:
   push  {r4, r5, r6, r7, r8, r9, r10, r11, lr}
   ldr   r4, [sp, #36]                 @ load width into r4
   mov   r5, #0                        @ r5 is current row index
row_loop: 
   mov   r6, #0                        @ r6 is the col, reset it for each new row
   cmp   r5, r3                        @ compare row with height
   beq   end_loops                     @ we have finished all of the rows
   ldr   r11, [r0, r5, lsl #2]         @ r11 is the current row array of C
   ldr   r7, [r1, r5, lsl #2]          @ r7 is the current row array of A
   ldr   r8, [r2, r5, lsl #2]          @ r8 is the current row array of B
                                       @ the left shifts are so that we skip
                                       @ 4 bytes since these are ints
                                       @ these do not change registers
col_loop:   
   cmp   r6, r4                        @ compare col with width
   beq   end_col                       @ we have finished this col
   ldr   r9, [r7, r6, lsl #2]          @ r9 is cur_row[col] of A
   ldr   r10, [r8, r6, lsl #2]         @ r10 is cur_row[col] of B
   add   r9, r9, r10                   @ r8 is A[row][col] + B[row][col]
   str   r9, [r11, r6, lsl #2]         @ store result of addition in C[row][col]
   add   r6, r6, #1                    @ increment col
   b     col_loop                      @ get next entry
end_col:
   add   r5, r5, #1                    @ increment row
   b     row_loop                      @ get next row
end_loops:   
   pop   {r4, r5, r6, r7, r8, r9, r10, r11, pc}

我认为必须有一些指令来组合cmp和b或其他东西，但我似乎无法找到它。关于如何减少指令数量的任何指示？

Answer 1

您想要从内循环中删除无条件分支。

loop_start:
    cmp x, y
    beq loop_exit

    blah blah blah

    b loop_start
loop_exit:

请注意，每次循环时，您都有一个无条件分支（b loop_start）。通过内联分支目标直到下一个条件分支来避免分支。

loop_start:
    cmp x, y
    beq loop_exit

loop_middle:
    blah blah blah

    ; was "b loop_start" but we just copy the instructions
    ; starting at "loop_start" up to the conditional branch

    cmp x, y
    beq loop_exit

    ; and then jump to the instruction after the inlined portion
    b loop_middle
loop_exit:

此时，beq只是分支上的一个分支，因此可以用反向分支替换它。

loop_start:
    cmp x, y
    beq loop_exit

loop_middle:
    blah blah blah

    cmp x, y

    ; "beq loop_exit" followed by "b loop_middle" is equivalent to this
    bne loop_middle

loop_exit:

您的代码中有两种机会进行优化。

（在提交解决方案时不要忘记引用此网页，以避免学术不诚实的指控。）

减少手臂的指令数量

1 个答案: