Question

我在优化Eratosthenes筛选器的代码方面取得了一些进展，但我需要进一步改进指令数量和数据缓存命中率。任何帮助表示赞赏。

    .data           # the data segment to store global data
space:  .asciiz " "     # whitespace to separate prime numbers

    .text           # the text segment to store instructions
    .globl  main        # define main to be a global label
main:   li  $s0, 0x00000000 # initialize $s0 with zeros
    nor $s1, $s0, $s0   # saves one ALU count over using li $s1, 0x11111111
    li  $t9, 200    # find prime numbers from 2 to $t9

    add $s2, $sp, 0 # backup bottom of stack address in $s2

    li  $t0, 2      # set counter variable to 2

init:   sw  $s1, ($sp)  # write ones to the stackpointer's address
    add $t0, $t0, 1 # increment counter variable
    sub $sp, $sp, 4 # subtract 4 bytes from stackpointer (push)
    bne     $t0, $t9, init  # take loop if $t0 != $t9, changed ble to bne
    addi    $t8, $t0, 15    # approximate square root of 200
    li  $t0, 1      # reset counter variable to 1

outer:  add     $t0, $t0, 2 # increment counter variable (start at 2)
    mul $t1, $t0, $t0   # squaring $t0 and save it to $t1
    bgt $t1, $t8, print # start printing prime numbers when $t1 > $t8, changed so only bgt if $t1 > square root

check:  add $t2, $s2, 0 # save the bottom of stack address to $t2
    sll $t3, $t0, 2 # calculate the number of bytes to jump over
    sub $t2, $t2, $t3   # subtract them from bottom of stack address
    add $t2, $t2, 8 # add 2 words - we started counting at 2!

    lw  $t3, ($t2)  # load the content into $t3

    beq $t3, $s0, outer # only 0's? go back to the outer loop

inner:  add $t2, $s2, 0 # save the bottom of stack address to $t2
    sll $t3, $t1, 2 # calculate the number of bytes to jump over

    add     $t4, $t1, 2 # save $t1 + 2 into $t4, added
    sll     $t5, $t4, 2 # mul by 4, added

    sub $t2, $t2, $t3   # subtract them from bottom of stack address

    add $t2, $t2, 8 # add 2 words - we started counting at 2!

    sw  $s0, ($t2)  # store 0's -> it's not a prime number!

    add $t1, $t1, $t0   # do this for every multiple of $t0

    add $t1, $t1, $t0   # adding $t0 to $t1, added
    sub $t2, $t2, $t5   # save $t2 - $t5 into $t2, added
    add $t2, $t2, 8 # adding 8 to $t2, added
    sw  $s0, ($t2)  # store contents of $s0 at address contained in $t2, added
    add $t4, $t4, $t0   # adding $t0 to $t4, added

    blt $t1, $t9, inner # every multiple done? go back to outer loop, changed to blt and branching to inner

    j   outer       # some multiples left? go back to inner loop, changed to branching to outer

print:  li  $t0, 1      # reset counter variable to 1

    # hard coding a 2
    li  $v0, 1
    addi    $a0, $a0, 2
    syscall

    # hard coding a space
    li  $v0, 4
    la  $a0, space
    syscall

count:  add $t0, $t0, 2 # increment counter variable (start at 2), skipping even numbers

    bgt $t0, $t9, exit  # make sure to exit when all numbers are done (branch to exit if $t0 > $t9)

    add $t2, $s2, 0 # save the bottom of stack address to $t2
    sll $t3, $t0, 2 # calculate the number of bytes to jump over
    sub $t2, $t2, $t3   # subtract them from bottom of stack address
    add $t2, $t2, 8 # add 2 words - we started counting at 2!

    lw  $t3, ($t2)  # load the content into $t3
    beq $t3, $s0, count # only 0's? go back to count loop

    add $t3, $s2, 0 # save the bottom of stack address to $t3

    sub $t3, $t3, $t2   # substract higher from lower address (= bytes)
    srl $t3, $t3, 2 # changed div to srl
    add $t3, $t3, 2 # add 2 (words) = the final prime number!

    li  $v0, 1      # system code to print integer
    add $a0, $t3, 0 # the argument will be our prime number in $t3
    syscall         # print it!

    li  $v0, 4      # system code to print string
    la  $a0, space  # the argument will be a whitespace
    syscall         # print it!

    bne $t0, $t9, count # take loop when $t0 != $t9, changed ble to bne

exit:   li  $v0, 10     # set up system call 10 (exit)
    syscall

Answer 1

第一个优化是在C / C ++中编写算法代码并应用尽职调查（清理/收紧代码）。如果这还不够快，请开始对Eratosthenes的Sieve进行常规优化：

使用打包位数组而不是按字节或单词表示数字
使用仅有几率的筛子（在需要时用空气拉出素数2）
筛选小的，缓存友好的段（例如，许多CPU上的32KB L1缓存）
记住段之间的偏移以避免代价高昂的模数划分
使用与少数小素数的筛选相对应的位模式初始化片段
调度段以并行筛选所有可用核心

Sieve of Eratosthenes - segmented to increase speed and range显示了分段筛的清洁和精益实施，并解释了这些优化（多线程除外）;它还包含可编译测试程序的链接，用于演示这些优化。

除非C编译器异常差，否则汇编编码可以将筛网的简单C实现的速度提高大约10％。在实际数字中，这意味着C版本可能需要15秒来筛选最多2 ^ 32的数字，而汇编程序版本可能需要13秒。应用上面列表中的前五个优化将C版本降低到大约2秒，添加最后一个可以使您达到不到一秒（如果您有8个空闲核心，则大约为0.3）。

如果这还不够快，那么汇编编码可能没有任何帮助，因为现在优化更多的是减少争用，避免深存储器缓存层次结构的延迟，并保持流水线充足的指令准备好执行（而不是等待依赖）。即，优化战争在算法和架构层面上获胜，而不是在装配微观优化方面。当前的编译器倾向于非常擅长自动应用微优化，如指令调度，避免分支错误预测处罚等.pp。当手动编码汇编需要非常努力甚至达到编译代码的速度时，更不用说超越它了

下一级优化将扩展仅赔率筛的想法，以排除更多的小素数（3,5,7等）。这通常被称为“轮子”。但是，添加更多辐条的收益越来越小。双辐轮（仅有几率的筛子）已经将工作量削减了一半，但是在图片中添加3只能再去除三分之一，依此类推，同时使代码变得更加复杂。复杂性可能使代码比仅有赔率的版本慢，所以你需要使用更高阶的轮子才能看到显着的改进。

在任何情况下，无论如何，优化都需要以高级语言进行原型化和参考化实现，这是人类可读的，例如： C / C ++并不需要MIPS的大祭司解释。

优化MIPS中Eratosthenes的筛选

1 个答案: