Question

我是大会的初学者，我在练习时遇到困难，我必须在装配中实现BMP编码的图像过滤器。每个像素以24位编码，使得像素的每个分量（蓝色，绿色，红色）以8位编码。为了表示图像，我有一个uint8_t数组。例如，阵列的第一个像素由array [0]（蓝色分量），array [1]（绿色分量）和array [2]（红色分量）表示。我要做的是实现一个过滤器，找到每个像素的3个组件的平均值，并将每个组件的值固定为该值。我的问题是计算那个意思。函数的签名是extern size_t filter(const uint8_t* in_buf, uint32_t w, uint32_t h);这是我的代码。

.text
.global filter

filter:
    push %ebp
    mov %esp, %ebp
    xor %ecx, %ecx #increment variable for_width = 0
    for_width:
            xor %ebx, %ebx #increment variable for_height = 0
            for_height:
                    calcul_offset: # find the position of the position of an i, j pixel
                            mov %ecx, %esi
                            mov %ebx, %edx
                            imull $3, %esi
                            imull $3, %edx
                            imull 12(%ebp), %edx #12(%ebp) = width of the image
                            add %edx, %esi
                    calcul_mean:
                            mov %esi, %edx
                            add 8(%esi), %edx
                            add 16(%esi), %edx #edx contains the sum of the 3 components
                    change_pixel: # supposing edx contains the mean of the 3 components
                            mov 8(%ebp), %esi #8(%ebp) is the adress of the array
                            mov %edx, (%esi) #blue component
                            mov %edx, 8(%esi) #green component
                            mov %edx, 16(%esi) #red component
                            #here is my problem - to divide %esi by 3
                    inc %ebx
                    cmp 16(%ebp), %ebx #16(%ebp) = height of the image
                    jle for_height
            inc %ecx
            cmp 12(%ebp), %ecx
            jle for_width
    # retour
    ret

请注意，我的代码还没有完成，而我现在只想弄清楚这个部门。

谢谢！亚历

Answer 1

由于您要求对代码提供反馈：不是每次迭代都将循环计数器乘以东西，而是使用add来遍历数组中的指针。这称为循环强度降低，因为add是比imul更便宜的操作。

即使没有力量减少，那些mov-and-imul序列真的很傻。 imul - immediate是一个3操作数指令，具有只写目标。 imul $3, %esi只是imul $3, %esi, %esi的缩写。显然你可以放弃mov %ecx, %esi。您还可以使用LEA乘以小常数。在英特尔语法（lea esi, [ecx + ecx*2]）中更明显，但lea (%ecx, %ecx, 2), %esi可以解决问题。

您也可以利用2寄存器寻址模式，而不是做很多工作来将最终地址转换为%esi。但是it might be better to stick to one-register addressing modes for perf reasons on Intel SnB-family CPUs。

我以为你说你的值是1字节uint8_t颜色成分？ add 8(%esi), %edx具有4B内存源操作数。我没有仔细阅读您的代码，但如果您想单独处理颜色组件，则需要add 8(%esi), %dl和类似的。

这是SIMD矢量化的理想选择。

div非常慢。通过该神线链接的乘法和移位like any good compiler does：clang-3.8输出来替换实际除法可能更快：

unsigned div3_32b(unsigned a) { return a/3; }
        movl    %edi, %ecx
        movl    $2863311531, %eax       # imm = 0xAAAAAAAB  
        imulq   %rcx, %rax
        shrq    $33, %rax
        retq

gcc使用32位单操作数mul而不是64b imul，所以如果重要的话，请看一下godbolt链接中的内容，否则64b - imul在大多数CPU上都会更快，IIRC。（有关优化链接，请参阅x86标记wiki。）

unsigned char div3_8b(unsigned char a) { return a/3; }
        imull   $171, %edi, %eax
        andl    $65024, %eax            # imm = 0xFE00
        shrl    $9, %eax
        retq

8b版本看起来像一个铿锵的bug：似乎假设%edi的高位被归零，所以8x8的高位 - > 16b乘法结果是正确的。 64位ABI不保证这一点。

但是在内联时这不是问题。实际上，你需要一个适用于32b寄存器中至少9b数字的版本，因为有3个8b数字可以溢出8b。

Answer 2

x86系列处理器有DIV指令，可以为您进行除法。 DIV指令使用特定的输入和输出寄存器，具体取决于操作数的大小。来自the programmer's guide：

将无符号值除以AX，DX：AX，EDX：EAX或RDX：RAX中的值通过源操作数（除数）注册（被除数）并存储导致AX（AH：AL），DX：AX，EDX：EAX或RDX：RAX寄存器。该源操作数可以是通用寄存器或存储器位置。该指令的操作取决于操作数大小（被除数/除数）。

所以要将ESI除以3，请使用此代码

mov %esi, %eax
xor %edx, %edx
mov $3, %ecx
div %ecx         
# the result is now in %eax

装配中3个数字的平均值

2 个答案: