我试图使用内联ASM加速一些乘法。该设备是Aarch64,它具有矢量乘法功能。乘法器执行64 x 64→128位乘法。
以下是输出操作数是SIMD /向量寄存器时乘法的显示方式:
uint64x2_t r = {0,0}, a = {2,4}, b = {6,8};
__asm__ __volatile__
(
"pmull %0.1q, %1.1d, %2.1d;"
: "=w" (r)
: "w" (a[0]), "w" (b[0])
: "cc"
);
如何删除输出操作数"=w" (r)
,并将其更改为由变量支持的两个 D 寄存器(我正在进行此操作,因为我&# 39;我不确定它应该是什么样子):
uint64x2_t r = {0,0}, a = {2,4}, b = {6,8};
uint64x1_t r1, r2;
__asm__ __volatile__
(
"pmull %0.1q, %1.1d, %2.1d;"
: "=w.1d" (r1), "=w.1d" (r2),
: "w" (a[0]), "w" (b[0])
: "cc"
);
或者:
uint64x2_t r = {0,0}, a = {2,4}, b = {6,8};
uint64_t r1, r2;
__asm__ __volatile__
(
"pmull %0.1q, %1.1d, %2.1d;"
: "=w.1d" (r1), "=w.1d" (r2),
: "w" (a[0]), "w" (b[0])
: "cc"
);
这个想法是,在乘法之后,我有两个64位值r1
和r2
,除了使用它们之外,我不需要做任何其他事情。