Question

我正在学习汇编并在我的Digital Mars C ++编译器中进行内联。我搜索了一些东西以使程序更好，并使用这些参数来调整程序：

use better C++ compiler//thinking of GCC or intel compiler

use assembly only in critical part of program 

find better algorithm

Cache miss, cache contention.

Loop-carried dependency chain.

Instruction fetching time.

Instruction decoding time.

Instruction retirement.

Register read stalls.

Execution port throughput.

Execution unit throughput.

Suboptimal reordering and scheduling of micro-ops.

Branch misprediction.

Floating point exception.

我理解所有除了＆＃34;注册读取档位＆＃34;。

问题：任何人都可以告诉我这是怎么发生在CPU和超级标量＆＃34; ＆＃34;无序执行的形式＆＃34;？正常＆＃34;乱序＆＃34;似乎合乎逻辑，但我无法找到＆＃34;超标量的合理解释＆＃34;形成。

问题2：有人可以提供一些SSE SSE2和更新CPU的优秀指令清单，其中包括微操作表，端口吞吐量，单位和一些延迟的计算表，以找到真正的瓶颈一段代码？

我会很满意这样一个小例子：

//loop carried dependency chain breaking:
__asm
{
loop_begin:
....
.... 
sub edx,05h //rather than taking i*5 in each iteration, we sub 5 each iteration
sub ecx,01h //i-- counter
...
...
jnz loop_begin//edit: sub ecx must have been after the sub edx for jnz
}
//while sub edx makes us get rid of a multiplication also makes that independent of ecx, making independent

谢谢。

计算机：Pentium-M 2GHz，Windows XP-32位

Answer 1

您应该查看Agner Fogs优化手册：Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms或Optimizing subroutines in assembly language: An optimization guide for x86 platforms。

但要真正能够超越现代编译器，您需要对要优化的拱门有一些很好的背景知识：The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers

Answer 2

我的两分钱：Intel Architecture Developers Manuals 非常详细，还有所有SSE指令，包括操作码，指令延迟和吞吐量，以及您可能需要的所有血腥细节:)

Answer 3

“超标量”档位是调度指令的附加问题。现代处理器不仅可以不按顺序执行指令，还可以使用并行执行单元一次执行3-4条简单指令。

但实际上，指令必须足够独立。例如，如果一条指令使用前一条指令的结果，则它必须等待该结果可用。

在实践中，这使得手动非常难以创建最佳装配程序。你真的必须像计算机（编译器）来计算指令的最佳顺序。如果你改变一条指令，你必须重新做一遍......

Answer 4

对于问题＃1，我强烈推荐Computer Architecture: A Quantitative Approach。它在上下文中解释概念方面做得非常好，因此您可以看到全局。这些示例对于对优化代码感兴趣的人也非常有用，因为他们始终专注于优先级并改善瓶颈。

assembly / __asm内联

4 个答案: