我一直在寻找SIMD选项来加快比较,发现函数__m128d _mm_cmpgt_sd (__m128d a, __m128d b)
显然,它比较较低的双精度值,并将较高的双精度值从a
复制到输出中。它在做什么很有意义,但是有什么意义呢?这是要解决什么问题?
答案 0 :(得分:4)
问题可能在于,在非常旧的硬件(例如Intel Pentium II和III)上,_mm_cmpgt_sd()
比_mm_cmpgt_pd()
快。请参阅Agner Fog的instruction tables。这些处理器(PII和PIII)仅具有64位宽的浮点单元。在这些处理器上,128位宽的SSE指令作为两个64位微操作执行。在更新的CPU(例如intel Core 2(Merom)和更新的CPU)上,_pd
和_ps
版本与_sd
和_ss
版本一样快。因此,如果您只需要比较单个元素并且不关心结果的高64位,则可能更喜欢_sd
和_ss
版本。
此外,如果高位垃圾位偶然包含_mm_cmpgt_pd()
或次普通数,NaN
可能会引发虚假浮点异常或性能下降,请参见Peter Cordes’ answer。尽管在实践中,使用内在函数进行编程时应该很容易避免出现这种较高的垃圾位。
如果您想对代码进行矢量化处理,并且需要进行打包双重比较,请使用内部_mm_cmpgt_pd()
而不是_mm_cmpgt_sd()
。
答案 1 :(得分:3)
cmpsd
is an instruction that exists in asm and operates on XMM registers, so it would be inconsistent not to expose it via intrinsics.
(Almost all packed-FP instructions (other than shuffles/blends) have a scalar version, so again there's a consistency argument for ISA design; it's just an extra prefix to the same opcode, and might require more transistors to special-case that opcode not supporting a scalar version.)
Whether or not you or the people designing the intrinsics API could think of a reasonable use-case is not at all the point. It would be foolish to leave things out on that basis; when someone comes up with a use-case they'll have to use inline asm or write C that compiles to more instructions.
Perhaps someone someday will find a use-case for a vector with a mask as the low half, and a still-valid double
in the high half. e.g. maybe _mm_and_ps
back onto the input to conditionally zero just the low element without needing a packed-compare in the high element to produce true.
Or consider that all-ones is a bit-pattern for NaN, and all-zero is the bit-pattern for +0.0
.
IIRC, cmppd
slows down if any of the elements are subnormal (if you don't have the DAZ bit set in MXCSR). At least on some older CPUs that existed when the ISA was being designed. So for FP compares, having scalar versions is (or was) essential for avoiding spurious FP assists for elements you don't care about.
Also for avoiding spurious FP exceptions (or setting exception flags if they're masked), like if there's a NaN in the upper element of either vector.
@wim also makes a good point that Intel CPUs before Core2 decoded 128-bit SIMD instructions to 2 uops, one for each 64-bit half. So using cmppd
when you don't need the high half result would always be slower, even if it can't fault. Lots of multi-uop instructions can easily bottleneck the front-end decoders on CPUs without a uop-cache, because only one of the decoders can handle them.
You don't normally use intrinsics for FP scalar instructions like cmpsd
or addsd
, but they exist in case you want them (e.g. as the last step in a horizontal sum). More often you just leave it to the compiler to use scalar versions of instructions when compiling scalar code without auto-vectorization.
And often for scalar compares, compilers will want the result in EFLAGS so will use ucomisd
instead of creating a compare mask, but for branchless code a mask is often useful, e.g. for a < b ? c : 0.0
with cmpsd
and andpd
. (Or really andps
because it's shorter and does the same thing as the pointless andpd
.)