我正在开发一个应该在ARMv7处理器设备上运行的原生Android应用程序。 出于某些原因,我需要对向量(短和/或浮点)进行一些繁重的计算。 我使用NEON命令实现了一些汇编功能来增强计算。我已经获得了1.5速度因素,这也不错。我想知道我是否可以更快地改进这些功能。
所以问题是:我可以做些什么改进来改善这些功能?
//add to float vectors.
//the result could be put in scr1 instead of dst
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{
asm volatile (
"1: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vadd.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a float vector by a scalar.
//the result could be put in scr1 instead of dst
void mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count)
{
asm volatile (
"vdup.32 q1, %[scalar] \n"
"2: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vmul.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 2b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//add to short vector -> no problem of coding limits
//the result should be put in in a dest different from src1 and scr2
void add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count)
{
asm volatile (
"3: \n"
"vld1.16 {q0}, [%[src1]]! \n"
"vld1.16 {q1}, [%[src2]]! \n"
"vadd.i16 q0, q0, q1 \n"
"subs %[count], %[count], #8 \n"
"vst1.16 {q0}, [%[dst]]! \n"
"bgt 3b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a short vector by a float vector and put the result bach into a short vector
//the result should be put in in a dest different from src1
void mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count)
{
asm volatile (
"4: \n"
"vld1.16 {d0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vmovl.s16 q0, d0 \n"
"vcvt.f32.s32 q0, q0 \n"
"vmul.f32 q0, q0, q1 \n"
"vcvt.s32.f32 q0, q0 \n"
"vmovn.s32 d0, q0 \n"
"subs %[count], %[count], #4 \n"
"vst1.16 {d0}, [%[dst]]! \n"
"bgt 4b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "d0", "q0", "q1"
);
}
提前致谢!
答案 0 :(得分:1)
您可以尝试展开循环以处理每个循环的更多元素。
add_float_vector_with_neon3的代码每4个元素需要10个周期(因为停止),而展开到16个元素需要21个周期。 http://pulsar.webshaker.net/ccc/sample-34e5f701
虽然存在开销,因为您需要处理剩余部分(或者您可以将数据填充为16的倍数),但如果您有大量数据,则与实际总和相比,开销应该相当低。 / p>
答案 1 :(得分:0)
这是一个关于如何使用neon instrinsics对其进行编码的示例。
优点是您可以使用编译器来优化寄存器分配和指令调度,同时限制指令的使用。
缺点是,GCC似乎无法将指针算法组合到加载/存储指令中,因此会发出额外的ALU指令来执行此操作。或许我错了,海湾合作委员会有充分的理由这样做。
使用GCC和CFLAGS=-std=gnu11 -O3 -fgcse-lm -fgcse-sm -fgcse-las -fgcse-after-reload -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -fPIE -Wall
,此代码编译为非常好的目标代码。循环被展开并交错以在加载结果可用之前隐藏长延迟。而且它也是可读的。
#include <arm_neon.h>
#define ASSUME_ALIGNED_FLOAT_128(ptr) ((float *)__builtin_assume_aligned((ptr), 16))
__attribute__((optimize("unroll-loops")))
void add_float_vector_with_neon3( float *restrict dst,
const float *restrict src1,
const float *restrict src2,
size_t size)
{
for(int i=0;i<size;i+=4){
float32x4_t inFloat41 = vld1q_f32(ASSUME_ALIGNED_FLOAT_128(src1));
float32x4_t inFloat42 = vld1q_f32(ASSUME_ALIGNED_FLOAT_128(src2));
float32x4_t outFloat64 = vaddq_f32 (inFloat41, inFloat42);
vst1q_f32 (ASSUME_ALIGNED_FLOAT_128(dst), outFloat64);
src1+=4;
src2+=4;
dst+=4;
}
}
答案 2 :(得分:0)
好的,我比较了初始帖子中给出的代码和Josejulio提出的新功能:
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{
asm volatile (
"1: \n"
"vld1.32 {q0,q1}, [%[src1]]! \n"
"vld1.32 {q2,q3}, [%[src2]]! \n"
"vadd.f32 q0, q0, q2 \n"
"vadd.f32 q1, q1, q3 \n"
"vld1.32 {q4,q5}, [%[src1]]! \n"
"vld1.32 {q6,q7}, [%[src2]]! \n"
"vadd.f32 q4, q4, q6 \n"
"vadd.f32 q5, q5, q7 \n"
"subs %[count], %[count], #16 \n"
"vst1.32 {q0, q1}, [%[dst]]! \n"
"vst1.32 {q4, q5}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7"
);
}
然而在工具(pulsar.webshaker.net/ccc/index.php)中,CPU cylcles / float存在很大差异,我没有看到延迟检查有太大差异:
median,firstQuartile,thirdQuartile,minVal,maxVal(micro-sec,1000 measure)
原文:3564,3206,5126,1761,12144
展开:3567,3080,4877,3018,11683
所以我不确定展开是否如此有效......