环境是Linux上的x86_64。我试图通过编写内联程序集来加快foo函数中的某些计算。我想将4个字符传递到32位寄存器中并进行一次计算,而不是分别对4个字符进行计算。变量a,b和c是指向4个字符的指针。
这是可以使我获得正确计算结果的工作代码,但比原始C代码要慢得多。我正在分别计算这四个字符。
__asm__(
"lea (%%esi, %%eax, 2), %%eax \n\t"
//... more calculations for a.x and a.x
: "=g" (a.x), "=g" (b.x)
: "r" (b.x), "r" (a.x), "r" (c.x)
:"%edx"
);
__asm__(
"lea (%%esi, %%eax, 2), %%eax \n\t"
//... more calculations for a.y and b.y
: "=g" (a.y), "=g" (b.y)
: "r" (b.y), "r" (a.y), "r" (c.y)
:"%edx"
);
__asm__(
"lea (%%esi, %%eax, 2), %%eax \n\t"
//... more calculations for a.b and a.c
: "=g" (a.z), "=g" (b.z)
: "r" (b.z), "r" (a.z), "r" (c.z)
:"%edx"
);
我没有直接传递字符,而是尝试直接传递a,b和c。我没有任何分段错误,但外部函数完全被搞乱了(不是因为溢出)。我需要更好地了解流水线和x86_64如何工作以提高性能。对我应该阅读的材料有什么建议吗?