Question

在这个link中，作者给出了一个例子

[:foo, :bar].sample # => :foo, or :bar :-)

并且作者将此代码评论为

另一种方法，可以进行许多优化（循环展开，CPU流水线操作，花费更少的时间来评估条件）将涉及在块中添加tt（例如，大小的块） 128）并在每个块之后检查条件。当它条件满足，最后一个块可以重复确定 ic的价值。

这是什么意思？循环展开？ CPU流水线？在块中添加subroutine threshold(a, thresh, ic) real, dimension(:), intent(in) :: a real, intent(in) :: thresh integer, intent(out) :: ic real :: tt integer :: n ic = 0 tt = 0.d0 n = size(a) do j = 1, n tt = tt + a(j) * a(j) if (sqrt(tt) >= thresh) then ic = j return end if end do end subroutine threshold

如作者所说，如何优化代码？

Answer 1

如果循环在适合CPU缓存的块/块中执行，则将减少缓存未命中数，从而减少从内存中检索的缓存行数。这样可以提高受内存操作限制的所有循环的性能。如果相应的块大小为BLOCKSIZE，则通过

实现

do j = 1, n, BLOCKSIZE
   do jj = j, j+BLOCKSIZE-1 
      tt = tt + a(jj) * a(jj)
   end do
end do

然而，这将留下未在主循环中处理的余数。为了说明这一点，请考虑一个长度为1000的数组。前七个块（1--896）在循环中被覆盖，但第八个块（897--1024）不是。因此，需要另一个剩余循环：

do j=(n/BLOCKSIZE)*BLOCKSIZE,n
   ! ...
enddo

虽然从余数循环中删除条件没有意义，但它可以在被阻塞的主循环的外循环中执行。由于现在内循环中没有分支，因此积极优化可能适用然而，这限制了确定的位置对块的“准确性”。要获得元素精确度，您必须重复计算。

以下是完整的代码：

subroutine threshold_block(a, thresh, ic)
  implicit none
  real, dimension(:), intent(in) :: a
  real, intent(in) :: thresh
  integer, intent(out) :: ic
  real :: tt, tt_bak, thresh_sqr
  integer :: n, j, jj
  integer,parameter :: BLOCKSIZE = 128

  ic = 0
  tt = 0.d0
  thresh_sqr = thresh**2
  n = size(a)
  ! Perform the loop in chunks of BLOCKSIZE
  do j = 1, n, BLOCKSIZE
     tt_bak = tt
     do jj = j, j+BLOCKSIZE-1 
        tt = tt + a(jj) * a(jj)
     end do
     ! Perform the check on the block level
     if (tt >= thresh_sqr) then
        ! If the threshold is reached, repeat the last block 
        ! to determine the last position
        tt = tt_bak
        do jj = j, j+BLOCKSIZE-1 
           tt = tt + a(jj) * a(jj)
           if (tt >= thresh_sqr) then
              ic = jj
              return
           end if
        end do
     end if
  end do

  ! Remainder is treated element-wise
  do j=(n/BLOCKSIZE)*BLOCKSIZE,n
     tt = tt + a(j) * a(j)
     if (tt >= thresh_sqr) then
        ic = j
        return
     end if
  end do 
end subroutine threshold_block

请注意，编译器现在非常善于与其他优化一起创建阻塞循环。根据我的经验，通过手动调整它，很难从这些简单的循环中获得更好的性能使用编译器选项gfortran在-floop-block中启用循环阻塞。

循环展开可以手动完成，但应留给编译器。我们的想法是在块中手动执行循环而不是如上所示的第二个循环，通过复制代码来执行操作。以下是上面给出的内循环的示例，用于循环展开因子4：

do jj = j, j+BLOCKSIZE-1,4
   tt = tt + a(jj)   * a(jj)
   tt = tt + a(jj+1) * a(jj+1)
   tt = tt + a(jj+2) * a(jj+2)
   tt = tt + a(jj+3) * a(jj+3)
end do

此处，如果BLOCKSIZE是4的倍数，则不会出现余数。你可以在这里刮掉一些操作;-) 启用此功能的gfortran编译器选项为-funroll-loops

据我所知，无法在Fortran中手动强制执行 CPU Pipelining （Instruction Pipelining）。此任务取决于编译器。

Pipelining设置了一系列指令。您将完整的阵列送入该管道，在收尾阶段之后，您将获得每个时钟周期的结果。这大大增加了吞吐量。然而，分支很难（不可能？）在管道中处理，并且阵列应该足够长，以便补偿设置管道，卷绕和放松阶段所需的时间。

如何避免循环中的条件

1 个答案: