Question

我正在努力熟悉霓虹灯的说明。汇编和内在函数。我使用gcc V4.8.2 hardfp 我想使用NEON memcpy和preload accordindg：

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

我也发现了这个话题： ARM memcpy and alignment 但这与官方ARM页面实现略有不同。

不幸的是我从未使用.s文件同时使用.s所以我需要一些帮助。我的.c文件如下所示：

       #include <stdlib.h>
       #include <stdio.h>
       #include <string.h>
       #include <math.h>
       #include <time.h>
       #include <stdint.h>
       #include <arm_neon.h> 

       int main()
       {

           clock_t start, end;           // timer variables
           uint32_t i,X=100;

           size_t size = 2048*32/* arbitrary */;
           size_t offset = 1;
           char* src = malloc(sizeof(char)*(size + offset));
           char* dst = malloc(sizeof(char)*(size));

           NEONCopyPLD( dst, src + offset, size );
           memcpy( dst, src + offset, size );
           return(0);
       }

和assembly.s文件如下：

       .global NEONCopyPLD
       NEONCopyPLD:
             PLD [r1, #0xC0]
             VLDM r1!,{d0-d7}
             VSTM r0!,{d0-d7}
             SUBS r2,r2,#0x40
             BGE NEONCopyPLD

我使用以下指令编译以下程序：

arm-linux-gnueabihf-gcc -mthumb -march = armv7-a -mtune = cortex-a9 -mcpu = cortex-a9 -mfloat-abi = hard -mfpu = neon -Ofast -fprefetch-loop-数组assembly.s asm_pr.c -o output

我收到以下错误：

 potentially unexpected fatal signal 11.

 CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
 task: bf907c00 ti: bef4a000 task.ti: bef4a000
 PC is at 0x4c90ce LR is at 0x852d
 pc : [<004c90ce>]    lr : [<0000852d>]    psr: 40030030
 sp : 7e958cb0  ip : 00000107  fp : 00000000
 r10: 76f91000  r9 : 00000000  r8 : 00000000
 r7 : 00001017  r6 : 00e85010  r5 : 00e75009  r4 : 00010001
 r3 : 000f4240  r2 : 00010000  r1 : 00e75009  r0 : 00e85010
 Flags: nZcv  IRQs on  FIQs on  Mode USER_32  ISA Thumb  Segment user
 Control: 10c5387d  Table: 4ef7404a  DAC: 00000015
 CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
 Backtrace:
 [<800120a4>] (dump_backtrace+0x0/0x118) from [<80012318>] (show_stack+0x20/0x24)
 [<800122f8>] (show_stack+0x0/0x24) from [<804fab0c>] (dump_stack+0x24/0x28)
 [<804faae8>] (dump_stack+0x0/0x28) from [<8000f560>] (show_regs+0x30/0x34)
 [<8000f530>] (show_regs+0x0/0x34) from [<8003349c>](get_signal_to_deliver+0x318/0x668)   
 [<80033184>] (get_signal_to_deliver+0x0/0x668) from [<80011664>] (do_signal+0x11c/0x450)
 [<80011548>] (do_signal+0x0/0x450) from [<80011b20>] (do_work_pending+0x74/0xac)
 [<80011aac>] (do_work_pending+0x0/0xac) from [<8000e500>] (work_pending+0xc/0x20)
 Segmentation fault

我遇到的另一个问题是，我们是否可以使用SIMD指令（内在函数或自动向量化）来加速0的数组初始化？我注意到以下代码无法自动向量化：

   for (i=0;i<N;i++)
        *(a++)=0;

然而，这段代码可以自动生成：

   for (i=0;i<N;i++)
       a[i]=i;

我的最终目标是调查是否可以使用比memset()更快的NEON功能。

最后，我想问一下关于不可逆循环的问题。根据：http://gcc.gnu.org/projects/tree-ssa/vectorization.html#unvectoriz 以下代码无法自动向量化：

           while (*p != NULL) {
              *q++ = *p++;
           }

但是，是否可以使用内在函数或汇编来开发此循环的更快版本？如果你做过类似的事情，可以在这里发帖吗？

Answer 1

与您的问题无关，但显示的代码示例无法正常工作。这是因为您似乎有对齐陷阱处于活动状态，并且正在点击一个：

       [ ... ]
       size_t offset = 1;
       char* src = malloc(sizeof(char)*(size + offset));
       [ ... ]
       NEONCopyPLD( dst, src + offset, size );

r7 : 00001017  r6 : 00e85010  r5 : 00e75009  r4 : 00010001
r3 : 000f4240  r2 : 00010000  r1 : 00e75009  r0 : 00e85010
                                   ^^^^^^^^

您使用未对齐指针与VLDM src由于offset == 1而永远不会对齐。

从reg转储中，由于你的Neon asm功能本身没有使用R5，你看到R1 == R5的事实让我得出结论你正在运行启用对齐陷阱，并在第一次点击SIGSEGV时获得VLDM 这是因为您在程序集中没有使用R5，所以C函数以前使用过的值;因此R1和R5没有不同意味着R1在陷阱被取消之前没有改变，这意味着VLDM R1!,... 甚至不能执行一次

Answer 2

您永远不会从汇编程序函数返回。因此，在汇编程序函数下面存储的任何代码都将被执行。这迟早会导致崩溃。

退出你的职能：

mov pc, lr

这很可能会解决您的问题。您还应该检查在汇编器函数调用期间必须保留哪些寄存器（neon 和通用寄存器）。

此页面是一个有用的资源，显示了如何执行此操作的示例：http://omappedia.org/wiki/Writing_ARM_Assembly

Answer 3

你可以google for＆＃34; aosp bionic memcpy＆＃34;。

这不是一个完美的，但相当不错的实施。

我建议你从memset开始，不过因为memcpy比你想象的要复杂得多。

分析仿生memset，尝试理解流程，并询问你是否理解为什么作者以特定的方式做了什么。

而且我也不明白为什么你在谈论自动矢量化这个完全没用的IMO。

请先自己做一些研究，然后问你是否卡住了。

要回答这个特定的问题，需要一个包含多个章节的整个教程，从基本的ARM指令开始。

NEON memcpy，memset和.c文件使用.c

3 个答案: