Question

我正在尝试将此霓虹灯代码转换为内在函数：

vld1.32                {d0}, [%[pInVertex1]]
flds                   s2, [%[pInVertex1], #8]

这将从变量pInVertex1中的3个32位浮点数加载到d0和d1寄存器中。我找不到任何与instrinsics相同的版本。有vld1q_f32，但这只适用于4个浮点数。任何人都知道这样做的有效方式（我的意思是没有额外的复制）？

Answer 1

在Aarch32中只写入3个32位浮点数的唯一指令是多重加载指令：

r0 holds the address of the structure
FLDMIAS r0, {s0-s2}

可以在VFP或Neon代码中使用。

我不知道相应的内在。

Answer 2

在DirectXMath中，我将XMLoadFloat3的ARM-NEON版本实现为：

float32x2_t x = vld1_f32( reinterpret_cast<const float*>(pSource) );
float32x2_t zero = vdup_n_f32(0);
float32x2_t y = vld1_lane_f32( reinterpret_cast<const float*>(pSource)+2, zero, 0 );
return vcombine_f32( x, y );

你如何使用霓虹内在函数加载3个浮点数

2 个答案: