Question

我正在研究一个代码，其中包含一个包含许多迭代的循环（~10 ^ 6-10 ^ 7），其中一个数组（比方说，'myresult'）正在通过大量贡献的求和来计算。在使用OpenMP的Fortran 90中，这看起来像是：

!$omp parallel do
!$omp& reduction(+:myresult)
do i=1,N
 myresult[i] = myresult[i] + [contribution]
enddo
!$omp end parallel

代码将在具有Intel Xeon协处理器的系统上运行，如果可能的话，当然希望从它们的存在中受益。我已经尝试使用MIC卸载语句（！dir $ offload target ...）与OpenMP，以便循环只在协处理器上运行，但是当它在那里等待协处理器完成时我浪费主机CPU时间。理想情况下，人们可以在主机和设备之间划分循环，因此我想知道以下内容是否可行（或者是否有更好的方法）;循环只能在主机上的一个核心上运行（尽管可能有OMP_NUM_THREADS = 2？）：

!$omp parallel sections
!$omp& reduction(+:myresult)

!$omp section ! parallel calculation on device
!dir$ offload target mic
!$omp parallel do
!$omp& reduction(+:myresult)
(do i=N/2+1,N)
!$omp end parallel do

!$omp section ! serial calculation on host
(do i=1,N/2)

!$omp end parallel sections

Answer 1

您是否考虑过使用MPI对称模式而不是卸载？如果你没有，MPI可以做你刚才描述的：你开始两个MPI排名，一个在主机上，一个在协处理器上。每个等级使用OpenMP执行并行循环。

Answer 2

一般的想法是使用MIC的异步卸载，以便CPU可以继续。撇开如何划分工作的细节，这就是它的表达方式：

module m
!dir$ attributes offload:mic :: myresult, micresult
integer :: myresult(10000)
integer :: result
integer :: micresult
end module

use m
N = 10000
result = 0
micresult = 0
myresult = 0
!dir$ omp offload target(mic:0) signal(micresult)
!$omp parallel do reduction(+:micresult)
do i=N,N/2
 micresult = myresult(i) + 55
enddo
!$omp end parallel do

!$omp parallel do reduction(+:result)
do i=1,N/2
 result = myresult(i) + 55
enddo
!$omp end parallel do

!dir$ offload_wait target(mic:0) wait(micresult)
result = result + micresult
end

具有Intel MIC卸载的异构OpenMP并行循环

2 个答案: