我在优化Eratosthenes筛选器的代码方面取得了一些进展,但我需要进一步改进指令数量和数据缓存命中率。任何帮助表示赞赏。
.data # the data segment to store global data
space: .asciiz " " # whitespace to separate prime numbers
.text # the text segment to store instructions
.globl main # define main to be a global label
main: li $s0, 0x00000000 # initialize $s0 with zeros
nor $s1, $s0, $s0 # saves one ALU count over using li $s1, 0x11111111
li $t9, 200 # find prime numbers from 2 to $t9
add $s2, $sp, 0 # backup bottom of stack address in $s2
li $t0, 2 # set counter variable to 2
init: sw $s1, ($sp) # write ones to the stackpointer's address
add $t0, $t0, 1 # increment counter variable
sub $sp, $sp, 4 # subtract 4 bytes from stackpointer (push)
bne $t0, $t9, init # take loop if $t0 != $t9, changed ble to bne
addi $t8, $t0, 15 # approximate square root of 200
li $t0, 1 # reset counter variable to 1
outer: add $t0, $t0, 2 # increment counter variable (start at 2)
mul $t1, $t0, $t0 # squaring $t0 and save it to $t1
bgt $t1, $t8, print # start printing prime numbers when $t1 > $t8, changed so only bgt if $t1 > square root
check: add $t2, $s2, 0 # save the bottom of stack address to $t2
sll $t3, $t0, 2 # calculate the number of bytes to jump over
sub $t2, $t2, $t3 # subtract them from bottom of stack address
add $t2, $t2, 8 # add 2 words - we started counting at 2!
lw $t3, ($t2) # load the content into $t3
beq $t3, $s0, outer # only 0's? go back to the outer loop
inner: add $t2, $s2, 0 # save the bottom of stack address to $t2
sll $t3, $t1, 2 # calculate the number of bytes to jump over
add $t4, $t1, 2 # save $t1 + 2 into $t4, added
sll $t5, $t4, 2 # mul by 4, added
sub $t2, $t2, $t3 # subtract them from bottom of stack address
add $t2, $t2, 8 # add 2 words - we started counting at 2!
sw $s0, ($t2) # store 0's -> it's not a prime number!
add $t1, $t1, $t0 # do this for every multiple of $t0
add $t1, $t1, $t0 # adding $t0 to $t1, added
sub $t2, $t2, $t5 # save $t2 - $t5 into $t2, added
add $t2, $t2, 8 # adding 8 to $t2, added
sw $s0, ($t2) # store contents of $s0 at address contained in $t2, added
add $t4, $t4, $t0 # adding $t0 to $t4, added
blt $t1, $t9, inner # every multiple done? go back to outer loop, changed to blt and branching to inner
j outer # some multiples left? go back to inner loop, changed to branching to outer
print: li $t0, 1 # reset counter variable to 1
# hard coding a 2
li $v0, 1
addi $a0, $a0, 2
syscall
# hard coding a space
li $v0, 4
la $a0, space
syscall
count: add $t0, $t0, 2 # increment counter variable (start at 2), skipping even numbers
bgt $t0, $t9, exit # make sure to exit when all numbers are done (branch to exit if $t0 > $t9)
add $t2, $s2, 0 # save the bottom of stack address to $t2
sll $t3, $t0, 2 # calculate the number of bytes to jump over
sub $t2, $t2, $t3 # subtract them from bottom of stack address
add $t2, $t2, 8 # add 2 words - we started counting at 2!
lw $t3, ($t2) # load the content into $t3
beq $t3, $s0, count # only 0's? go back to count loop
add $t3, $s2, 0 # save the bottom of stack address to $t3
sub $t3, $t3, $t2 # substract higher from lower address (= bytes)
srl $t3, $t3, 2 # changed div to srl
add $t3, $t3, 2 # add 2 (words) = the final prime number!
li $v0, 1 # system code to print integer
add $a0, $t3, 0 # the argument will be our prime number in $t3
syscall # print it!
li $v0, 4 # system code to print string
la $a0, space # the argument will be a whitespace
syscall # print it!
bne $t0, $t9, count # take loop when $t0 != $t9, changed ble to bne
exit: li $v0, 10 # set up system call 10 (exit)
syscall
答案 0 :(得分:0)
第一个优化是在C / C ++中编写算法代码并应用尽职调查(清理/收紧代码)。如果这还不够快,请开始对Eratosthenes的Sieve进行常规优化:
Sieve of Eratosthenes - segmented to increase speed and range显示了分段筛的清洁和精益实施,并解释了这些优化(多线程除外);它还包含可编译测试程序的链接,用于演示这些优化。
除非C编译器异常差,否则汇编编码可以将筛网的简单C实现的速度提高大约10%。在实际数字中,这意味着C版本可能需要15秒来筛选最多2 ^ 32的数字,而汇编程序版本可能需要13秒。应用上面列表中的前五个优化将C版本降低到大约2秒,添加最后一个可以使您达到不到一秒(如果您有8个空闲核心,则大约为0.3)。
如果这还不够快,那么汇编编码可能没有任何帮助,因为现在优化更多的是减少争用,避免深存储器缓存层次结构的延迟,并保持流水线充足的指令准备好执行(而不是等待依赖)。即,优化战争在算法和架构层面上获胜,而不是在装配微观优化方面。当前的编译器倾向于非常擅长自动应用微优化,如指令调度,避免分支错误预测处罚等.pp。当手动编码汇编需要非常努力甚至达到编译代码的速度时,更不用说超越它了
下一级优化将扩展仅赔率筛的想法,以排除更多的小素数(3,5,7等)。这通常被称为“轮子”。但是,添加更多辐条的收益越来越小。双辐轮(仅有几率的筛子)已经将工作量削减了一半,但是在图片中添加3只能再去除三分之一,依此类推,同时使代码变得更加复杂。复杂性可能使代码比仅有赔率的版本慢,所以你需要使用更高阶的轮子才能看到显着的改进。
在任何情况下,无论如何,优化都需要以高级语言进行原型化和参考化实现,这是人类可读的,例如: C / C ++并不需要MIPS的大祭司解释。