使用NEON Copy但不是memcpy的最终ARM Linux内存碎片

时间:2018-04-11 16:48:44

标签: c++ linux arm memcpy neon

我在BeagleBone X-15(ARM Cortex-A15)主板上运行Linux 4.4。我的应用程序mmaps SGX GPU的输出,需要复制DRM后备存储。

memcpy和我的自定义NEON复制代码都可以工作......但是NEON代码要快得多(~11ms vs.35ms)。

我注意到,相当一致,在12500秒后,当我使用NEON版本的副本时,Linux会杀死应用程序内存不足(OOM)。当我运行应用程序并将一行从NEON副本更改为标准memcpy时,它会无限期地运行(到目前为止12小时......)。但复制速度较慢。

我已粘贴下面的mmap,copy和NEON复制代码。我的NEON副本有什么问题吗?谢谢。

NEON Copy:

CucumberOptions

Mmap和复制代码:

/**
* CompOpenGL neonCopyRGBAtoRGBA()
* Purpose: neonCopyRGBAtoRGBA - Software NEON copy
*
* @param src - Source buffer
* @param dst - Destination buffer
* @param numpix - Number of pixels to convert
*/
__attribute__((noinline)) void CompOpenGL::neonCopyRGBAtoRGBA(unsigned char* src, unsigned char* dst, int numPix)
{

    (void)src;
    (void)dst;
    (void)numPix;

    // This case takes RGBA -> BGRA
    __asm__ volatile(
                "mov r3, r3, lsr #3\n"           /* Divide number of pixels by 8 because we process them 8 at a time */
                "loopRGBACopy:\n"
                "vld4.8 {d0-d3}, [r1]!\n"        /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
                "subs r3, r3, #1\n"              /* Decrement the loop counter */
                "vst4.8 {d0-d3}, [r2]!\n"        /* Store the RGBA into destination 8 pixels at a time */
                "bgt loopRGBACopy\n"
                "bx lr\n"
                );

}

1 个答案:

答案 0 :(得分:1)

好消息是,如果将vld4/vst4替换为vld1/vst1,您的功能将大大加快。

坏消息是你必须报告你使用和修改的寄存器,包括CPSR和内存,你不应该从内联汇编返回。 (bx lr)。

__asm__ volatile(
                "mov r3, r3, lsr #3\n"           /* Divide number of pixels by 8 because we process them 8 at a time */
                "loopRGBACopy:\n"
                "vld1.8 {d0-d3}, [r1]!\n"        /* Load 8 pixels into d0 through d2. d0 = R[0-7], d1 = G[0-7], d2 = B[0-7], d3 = A[0-7] */
                "subs r3, r3, #1\n"              /* Decrement the loop counter */
                "vst1.8 {d0-d3}, [r2]!\n"        /* Store the RGBA into destination 8 pixels at a time */
                "bgt loopRGBACopy\n"
                ::: "r1", "r2", "r3", "d0", "d1", "d2", "d3", "cc", "memory"
                );

http://www.ethernut.de/en/documents/arm-inline-asm.html