应用错误收集

问题可能在于，在非常旧的硬件（例如Intel Pentium II和III）上，_mm_cmpgt_sd()比_mm_cmpgt_pd()快。请参阅Agner Fog的instruction tables。这些处理器（PII和PIII）仅具有64位宽的浮点单元。在这些处理器上，128位宽的SSE指令作为两个64位微操作执行。在更新的CPU（例如intel Core 2（Merom）和更新的CPU）上，_pd和_ps版本与_sd和_ss版本一样快。因此，如果您只需要比较单个元素并且不关心结果的高64位，则可能更喜欢_sd和_ss版本。

此外，如果高位垃圾位偶然包含_mm_cmpgt_pd()或次普通数，NaN可能会引发虚假浮点异常或性能下降，请参见Peter Cordes’ answer。尽管在实践中，使用内在函数进行编程时应该很容易避免出现这种较高的垃圾位。

如果您想对代码进行矢量化处理，并且需要进行打包双重比较，请使用内部_mm_cmpgt_pd()而不是_mm_cmpgt_sd()。

cmpsd is an instruction that exists in asm and operates on XMM registers, so it would be inconsistent not to expose it via intrinsics.

(Almost all packed-FP instructions (other than shuffles/blends) have a scalar version, so again there's a consistency argument for ISA design; it's just an extra prefix to the same opcode, and might require more transistors to special-case that opcode not supporting a scalar version.)

Whether or not you or the people designing the intrinsics API could think of a reasonable use-case is not at all the point. It would be foolish to leave things out on that basis; when someone comes up with a use-case they'll have to use inline asm or write C that compiles to more instructions.

Perhaps someone someday will find a use-case for a vector with a mask as the low half, and a still-valid double in the high half. e.g. maybe _mm_and_ps back onto the input to conditionally zero just the low element without needing a packed-compare in the high element to produce true.

Or consider that all-ones is a bit-pattern for NaN, and all-zero is the bit-pattern for +0.0.

IIRC, cmppd slows down if any of the elements are subnormal (if you don't have the DAZ bit set in MXCSR). At least on some older CPUs that existed when the ISA was being designed. So for FP compares, having scalar versions is (or was) essential for avoiding spurious FP assists for elements you don't care about.

Also for avoiding spurious FP exceptions (or setting exception flags if they're masked), like if there's a NaN in the upper element of either vector.

@wim also makes a good point that Intel CPUs before Core2 decoded 128-bit SIMD instructions to 2 uops, one for each 64-bit half. So using cmppd when you don't need the high half result would always be slower, even if it can't fault. Lots of multi-uop instructions can easily bottleneck the front-end decoders on CPUs without a uop-cache, because only one of the decoders can handle them.

You don't normally use intrinsics for FP scalar instructions like cmpsd or addsd, but they exist in case you want them (e.g. as the last step in a horizontal sum). More often you just leave it to the compiler to use scalar versions of instructions when compiling scalar code without auto-vectorization.

And often for scalar compares, compilers will want the result in EFLAGS so will use ucomisd instead of creating a compare mask, but for branchless code a mask is often useful, e.g. for a < b ? c : 0.0 with cmpsd and andpd. (Or really andps because it's shorter and does the same thing as the pointless andpd.)

_mm_cmpgt_sd和其他类似方法的意义是什么？

2 个答案: