我正在尝试为ARM Cortex-A9生成浮点代码。我正在研究为NEON协处理器生成的代码与仅为VFPV3协处理器生成的代码之间的性能差异。我从以下简单的测试程序开始:
#define ASIZE 4
float A[ASIZE] = {7.0f, 2.0f, 3.0f, 4.0f};
float B[ASIZE] = {5.0f, 6.0f, 7.0f, 8.0f};
float C[ASIZE];
int main(void) {
unsigned int i;
for (i=0; i<ASIZE; i++)
{
C[i] = A[i] + B[i];
}
return 0;
}
当我使用以下标志编译它时
CCFLAGS = -g -c -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp -ffast-math -funsafe-math-optimizations
我从GCC或Code Sourcery Lite编译器获得以下汇编输出:
9:atest.c **** int main(void) {
23 .loc 1 9 0
24 .cfi_startproc
25 @ args = 0, pretend = 0, frame = 0
26 @ frame_needed = 0, uses_anonymous_args = 0
27 @ link register save eliminated.
10:atest.c ****
11:atest.c **** unsigned int i;
12:atest.c ****
13:atest.c **** for (i=0; i<ASIZE; i++)
14:atest.c **** {
15:atest.c **** C[i] = A[i] + B[i];
28 .loc 1 15 0
29 0000 003000E3 movw r3, #:lower16:.LANCHOR0
30 0004 002000E3 movw r2, #:lower16:C
31 0008 003040E3 movt r3, #:upper16:.LANCHOR0
32 000c DF2A63F4 vld1.64 {d18-d19}, [r3:64]
33 0010 040BD3ED vldr d16, [r3, #16]
34 0014 061BD3ED vldr d17, [r3, #24]
35 0018 E00D42F2 vadd.f32 q8, q9, q8
36 001c 002040E3 movt r2, #:upper16:C
16:atest.c **** }
17:atest.c ****
18:atest.c **** return 0;
19:atest.c **** }
这是我期望看到的。浮点指令的形式为“Vxxx”。
现在当我将编译器标志更改为-mfpu = vfpv3(或任何其他排列,例如-mfpu = vfpv3-d16-f16)时,我看到以下内容:
9:atest.c **** int main(void) {
23 .loc 1 9 0
24 .cfi_startproc
25 @ args = 0, pretend = 0, frame = 0
26 @ frame_needed = 0, uses_anonymous_args = 0
27 @ link register save eliminated.
28 .LVL0:
11:atest.c **** unsigned int i;
13:atest.c **** for (i=0; i<ASIZE; i++)
14:atest.c **** {
15:atest.c **** C[i] = A[i] + B[i];
29 .loc 1 15 0
30 0000 003000E3 movw r3, #:lower16:.LANCHOR0
31 0004 002000E3 movw r2, #:lower16:C
32 0008 003040E3 movt r3, #:upper16:.LANCHOR0
33 000c 002040E3 movt r2, #:upper16:C
34 0010 004A93ED flds s8, [r3]
16:atest.c **** }
18:atest.c **** return 0;
19:atest.c **** }
35 .loc 1 19 0
36 0014 0000A0E3 mov r0, #0
15:atest.c **** }
37 .loc 1 15 0
38 0018 046A93ED flds s12, [r3, #16]
39 001c 014AD3ED flds s9, [r3, #4]
40 0020 056AD3ED flds s13, [r3, #20]
41 0024 025A93ED flds s10, [r3, #8]
42 0028 067A93ED flds s14, [r3, #24]
43 002c 035AD3ED flds s11, [r3, #12]
44 0030 077AD3ED flds s15, [r3, #28]
45 0034 066A34EE fadds s12, s8, s12
46 0038 A66A74EE fadds s13, s9, s13
47 003c 077A35EE fadds s14, s10, s14
48 0040 A77A75EE fadds s15, s11, s15
49 0044 006A82ED fsts s12, [r2]
50 .LVL1:
51 0048 016AC2ED fsts s13, [r2, #4]
52 .LVL2:
53 004c 027A82ED fsts s14, [r2, #8]
54 .LVL3:
55 0050 037AC2ED fsts s15, [r2, #12]
56 .LVL4:
57 .loc 1 19 0
58 0054 1EFF2FE1 bx lr
59 .cfi_endproc
60 .LFE0:
61 .fnend
所有浮点汇编指令的格式均为“Fxxx”。为什么他们不是“Vxxx”的形式?我期待看到看起来像VLD1.32的加载指令并添加看起来像VADD.F32的指令。当我在官方ARM文档中搜索“flds”指令时,它说“flds”用于ARM9架构,而不是Cortex-A9。
我已经尝试了-mcpu,-mfpu,-march编译器标志的每个组合,但我似乎无法使用适用于Linux的GCC编译器或Code Sorcery Lite以“Vxxx”形式生成浮点汇编指令Linux编译器。我做错了什么?
答案 0 :(得分:1)
我做错了什么?
绝对没有,除非你算上一个旧的反汇编程序。 The instructions are the same, the encodings are the same, it's just the recommended assembly mnemonics that changed。显然,自从ARM引入UAL语法以来,您使用的任何反汇编程序(我不认识输出格式)都没有更新,因此已经拆解为旧的助记符。请随意尝试另一个反汇编程序(例如最近的objdump
)来进行比较,但正如我所说,这只是表示上的差异 - 无需担心。