Question

Hello,

So, I'm optimizing some functions that I wrote for a simple operating system I'm developing. This function, WHERE, currently looks like this (in case my assembly is unclear or wrong):

putpixel()

This takes a little bit of explanation. First, uint32_t loc = (x*pixel_w)+(y*pitch); vidmem[loc] = color & 255; vidmem[loc+1] = (color >> 8) & 255; vidmem[loc+2] = (color >> 16) & 255; is the pixel index I want to write to in video memory. X and Y coordinates are passed to the function. Then, we multiply X by the pixel width in bytes (in this case, 3) and Y by the number of bytes in each line. More information can be found here.

loc is a global variable, a vidmem pointer to video memory.

That being said, anyone familiar with bitwise operations should be able to figure out how uint8_t works fairly easily.

Now, here's my assembly. Note that it has not been tested and may even be slower or just plain not work. This question is about how to make it compile.

I've replaced everything after the definition of putpixel() with this:

loc

When I compile this, clang gives me this error for every __asm( "push %%rdi;" "push %%rbx;" "mov %0, %%rdi;" "lea %1, %%rbx;" "add %%rbx, %%rdi;" "pop %%rbx;" "mov %2, %%rax;" "stosb;" "shr $8, %%rax;" "stosb;" "shr $8, %%rax;" "stosb;" "pop %%rdi;" : : "r"(loc), "r"(vidmem), "r"(color) ); instruction: unknown use of instruction mnemonic without a size suffix

So when I saw that error, I assumed it had to do with my omission of the GAS suffixes (which should have been implicitly decided on, anyway). But when I added the "l" suffix (all of my variables are pushs), I got the same error! I'm not quite sure what's causing it, and any help would be much appreciated. Thanks in advance!

Answer 1

通过在存储之前将vidmem加载到本地变量中，您可以使C版本的编译器输出更加高效。实际上，它不能假设商店没有别名vidmem，因此它会在每个字节存储之前重新加载指针。 Hrm，确实让gcc 4.9.2避免重新加载vidmem，但它仍会生成一些讨厌的代码。 clang 3.5稍好一点。

在我对你的回答的评论中实现我所说的内容（stos为3 uops，而mov为1）：

#include <stdint.h>

extern uint8_t *vidmem;
void putpixel_asm_peter(uint32_t color, uint32_t loc)
{
    // uint32_t loc  = (x*pixel_w)+(y*pitch);
    __asm(  "\n"
        "\t movb %b[col], (%[ptr])\n"
        "\t shrl $8, %[col];\n"
        "\t movw %w[col], 1(%[ptr]);\n"
        : [col] "+r" (color),  "=m" (vidmem[loc])
        : [ptr] "r" (vidmem+loc)
        :
        );
}

编译成一个非常有效的实现：

gcc -O3 -S -o- putpixel.c 2>&1 | less  # (with extra lines removed)

putpixel_asm_peter:
        movl    %esi, %esi
        addq    vidmem(%rip), %rsi
#APP
        movb %dil, (%rsi)
        shrl $8, %edi;
        movw %di, 1(%rsi);
#NO_APP
        ret

所有这些指令都解码为Intel CPU上的单个uop。（这些存储可以微熔合，因为它们使用单寄存器寻址模式。）movl %esi, %esi将上面的32归零，因为调用者可能已经使用64位指令生成了该函数arg，高位中的左垃圾32 %rsi。您的版本可以通过使用约束来保存一些指令，以便首先询问所需寄存器中的值，但这仍然会比stos

更快

另请注意我如何让编译器负责将loc添加到vidmem。您可以使用lea将添加和移动相结合，从而更有效地完成此操作。但是，如果编译器想要在循环中使用它时变得聪明，它可能会增加指针而不是地址。最后，这意味着相同的代码将适用于32位和64位。 %[ptr]将是64位模式下的64位注册，但是32位模式下的32位注册。因为我不需要做任何数学计算，所以只是工作。

我使用=m输出约束来告诉编译器我们在内存中写入的位置。（我应该将指针转换为struct { char a[3]; }或其他东西，告诉gcc实际写入多少内存，根据the gcc manual中“Clobbers”部分末尾的提示

我还使用color作为输入/输出约束来告诉编译器我们修改它。如果内联，以及后来的代码仍然在寄存器中找到color的值，我们就会遇到问题。在函数中使用它意味着color已经是调用者值的tmp副本，因此编译器将知道它需要丢弃旧颜色。通过两个只读输入调用循环可以稍微提高效率：一个用于color，一个用于color >> 8。

请注意，我可以将约束写为

    : [col] "+r" (color), [memref] "=m" (vidmem[loc])
    :
    :

但是使用%[memref]和1 %[memref]生成所需的地址会导致gcc发出

    movl    %esi, %esi
    movq    vidmem(%rip), %rax
# APP
    movb %edi, (%rax,%rsi)
    shrl $8, %edi;
    movw %edi, 1 (%rax,%rsi);

双寄存器寻址模式意味着存储指令不能微熔合（至少在Sandybridge上，以后）。

你甚至不需要内联asm来获得合适的代码，但是：

void putpixel_cast(uint32_t color, uint32_t loc)
{
    // uint32_t loc  = (x*pixel_w)+(y*pitch);
    typeof(vidmem) vmem = vidmem;
    vmem[loc]   = color & 255;
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    *(uint16_t *)(vmem+loc+1) = color >> 8;
#else
    vmem[loc+1] = (color >> 8) & 255; // gcc sucks at optimizing this for little endian :(
    vmem[loc+2] = (color >> 16) & 255;
#endif
}

编译为（gcc 4.9.2和clang 3.5给出相同的输出）：

    movq    vidmem(%rip), %rax
    movl    %esi, %esi
    movb    %dil, (%rax,%rsi)
    shrl    $8, %edi
    movw    %di, 1(%rax,%rsi)
    ret

这比我们使用内联asm的效率要低一点，如果内联到循环中，优化器应该更容易优化。

整体表现

在循环中调用它可能是一个错误。将多个像素组合在寄存器（尤其是向量寄存器）中，然后一次写入所有内容会更有效。或者，执行4字节写操作，与前一次写操作的最后一个字节重叠，直到结束并且必须在最后一个3块之后保留字节。

有关优化C和asm的更多信息，请参阅http://agner.org/optimize/。可以在https://stackoverflow.com/tags/x86/info找到该链接和其他链接。

Answer 2

发现了问题！

这是在很多地方，但主要的是vidmem。我以为它会传递地址，但它导致了错误。在将它称为dword之后，它完美地运作了。我还必须将其他约束更改为“m”，我终于得到了这个结果（经过一些优化）：

__asm(
    "movl %0, %%edi;"
    "movl %k1, %%ebx;" 
    "addl %%ebx, %%edi;"
    "movl %2, %%eax;"
    "stosb;"
    "shrl $8, %%eax;"
    "stosw;" : :
    "m"(loc), "r"(vidmem), "m"(color)
    : "edi", "ebx", "eax"
);

感谢所有在评论中回答的人！

Inline Assembly Causing Errors about No Prefixes

2 个答案:

你甚至不需要内联asm来获得合适的代码，但是：

整体表现