添加通道中的所有元素

时间:2012-08-29 04:55:16

标签: c arm simd neon

是否存在允许人们添加通道中所有元素的内在因素?我正在使用Neon将8个数字相乘,我需要对结果求和。下面是一些释义代码,用于显示我目前正在做的事情(这可能会被优化):

int16_t p[8], q[8], r[8];
int32_t sum;
int16x8_t pneon, qneon, result;

p[0] = some_number;
p[1] = some_other_number; 
//etc etc
pneon = vld1q_s16(p);

q[0] = some_other_other_number;
q[1] = some_other_other_other_number;
//etc etc
qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
vst1q_s16(r,result);
sum = ((int32_t) r[0] + (int32_t) r[1] + ... //etc );

有没有“更好”的方法呢?

3 个答案:

答案 0 :(得分:4)

如果你的目标是更新的64位架构,那么ADDV就是你的正确指示。

以下是您的代码的外观。

qneon = vld1q_s16(q);
result = vmulq_s16(p,q);
sum = vaddvq_s16(result);

那就是它。只需一条指令来总结向量寄存器中的所有通道。

可悲的是,这条指令在旧的32位arm架构中并没有出现。

答案 1 :(得分:0)

这样的事情应该非常理想(谨慎:未经测试)

const int16x4_t result_low = vget_low_s16(result); // Extract low 4 elements
const int16x4_t result_high = vget_high_s16(result); // Extract high 4 elements
const int32x4_t twopartsum = vaddl_s16(result_low, result_high); // Extend to 32 bits and add (4 partial 32-bit sums are formed)
const int32x2_t twopartsum_low = vget_low_s32(twopartsum); // Extract 2 low 32-bit partial sums
const int32x2_t twopartsum_high = vget_high_s32(twopartsum); // Extract 2 high 32-bit partial sums
const int32x2_t fourpartsum = vadd_s32(twopartsum_low, twopartsum_high); // Add partial sums (2 partial 32-bit sum are formed)
const int32x2_t eightpartsum = vpadd_s32(fourpartsum, fourpartsum); // Final reduction
const int32_t sum = vget_lane_s32(eightpartsum, 0); // Move to general-purpose registers

答案 2 :(得分:0)

temp = vadd_f32(vget_high_f32(variance_n), vget_low_f32(variance_n)); 
sum  = vget_lane_f32(vpadd_f32(variance_temp, variance_temp), 0);