ARM霓虹灯矩阵乘法示例

时间:2017-02-14 09:57:56

标签: assembly arm matrix-multiplication simd neon

我想学习霓虹灯。 我在ARM网站上举了例子:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0425/ch04s06s05.html

我以为我会让它运行然后开始尝试它。吻。 程序编译正常(GCC),但是,当遇到第一个VST指令时,我得到'分段故障'。删除VST指令,程序就行了。 使用GDB似乎一切正常(注册值等),只是在内存存储过程中出现的错误。

感谢任何指导或帮助...

.global main
.func main
main:

.macro mul_col_f32 res_q, col0_d, col1_d
vmul.f32    \res_q, q8,  \col0_d[0] @ multiply col element 0 by matrix col 0
vmla.f32    \res_q, q9,  \col0_d[1] @ multiply-acc col element 1 by matrix col 1
vmla.f32    \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2
vmla.f32    \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3
.endm

LDR R0, =result0a
LDR R1, =result1a
LDR R2, =result2a

vld1.32  {d16-d19}, [r1]!   @ load the first eight elements of matrix 0
vld1.32  {d20-d23}, [r1]!   @ load the second eight elements of matrix 0
vld1.32  {d0-d3}, [r2]!         @ load the first eight elements of matrix 1
vld1.32  {d4-d7}, [r2]!         @ load the second eight elements of matrix 1

mul_col_f32 q12, d0, d1     @ matrix 0 * matrix 1 col 0
mul_col_f32 q13, d2, d3     @ matrix 0 * matrix 1 col 1
mul_col_f32 q14, d4, d5     @ matrix 0 * matrix 1 col 2
mul_col_f32 q15, d6, d7     @ matrix 0 * matrix 1 col 3

vst1.32  {d24-d27}, [r0]!   @ store first eight elements of result.
vst1.32  {d28-d31}, [r0]!   @ store second eight elements of result.

MOV R7, #1
SWI 0

result1a:   .word 0xFFFFFFFF    @ d16
result1b:   .word 0xEEEEEEEE    @ d16
result1c:   .word 0xDDDDDDDD    @ d17
result1d:   .word 0xCCCCCCCC    @ d17
result1e:   .word 0xBBBBBBBB    @ d18
result1f:   .word 0xAAAAAAAA    @ d18
result1g:   .word 0x99999999    @ d19
result1h:   .word 0x88888888    @ d19

result2a:   .word 0x77777777    @ d0
result2b:   .word 0x66666666    @ d0
result2c:   .word 0x55555555    @ d1
result2d:   .word 0x44444444    @ d1
result2e:   .word 0x33333333    @ d2
result2f:   .word 0x22222222    @ d2
result2g:   .word 0x11111111    @ d3
result2h:   .word 0x0F0F0F0F    @ d3

result0a:   .word 0x0       @ R0
result0b:   .word 0x0       @ R0
result0c:   .word 0x0       @ R0
result0d:   .word 0x0       @ R0
result0e:   .word 0x0       @ R0
result0f:   .word 0x0       @ R0
result0g:   .word 0x0       @ R0
result0h:   .word 0x0       @ R0

1 个答案:

答案 0 :(得分:1)

当你只分配4 * 8 = 32bytes时,你试图写8 * 8 = 64bytes。

此外,您最有可能尝试写入.text声明的只读区域

为什么不通过C / C ++调用您的函数,传递您通过malloc分配的地址?

或者,您可以简单地使用堆栈。