我知道有时也会问类似的问题:Openmp array reductions with Fortran,Reducing on array in OpenMP,甚至在英特尔论坛(https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/345415)中,但我想知道您的意见,因为可扩展性我得到的不是我期望的。
因此,我需要填充非常大的复数数组,我想将其与OpenMP并行化。我们的第一种方法是:
COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index
ALLOCATE(huge_array(huge_number))
!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
ALLOCATE(thread_huge_array(SIZE(huge_array)))
thread_huge_array = ZERO
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,6
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index) = thread_huge_array(depending_index)
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP BARRIER
!$OMP MASTER
huge_array = ZERO
!$OMP END MASTER
!$OMP CRITICAL
huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL
因此,通过这种方法,我们可以获得良好的可扩展性,直到8个内核,合理的可扩展性直到32个内核,并从40个内核开始,这要比16个内核(我们的机器具有80个物理内核)要慢。当然,我们不能使用REDUCTION子句,因为数组的大小太大,以致不能容纳在堆栈中(甚至将ulimit增加到机器允许的最大值)。
我们为此尝试了另一种方法:
COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), POINTER:: thread_huge_array(:)
INTEGER :: huge_number
ALLOCATE(huge_array(huge_number))
ALLOCATE(thread_huge_array(SIZE(huge_array),omp_get_max_threads()))
thread_huge_array = ZERO
!$OMP PARALLEL PRIVATE (num_thread)
num_thread = omp_get_thread_num()+1
!$OMP DO
DO index1=1,bignumber1
! Some calculations
DO index2=1,bignumber2
! Some calculations
DO index3=1,6
DO index4=1,num_weights_sp
DO index5=1,smallnumber
depending_index = function(index1, index2, index3, index4, index5)
thread_huge_array(depending_index, omp_get_thread_num()) = thread_huge_array(depending_index, omp_get_thread_num())
ENDDO
ENDDO
ENDDO
ENDDO
ENDDO
!$OMP END DO
!$OMP END PARALLEL
huge_array = ZERO
DO index_ii = 1,omp_get_max_threads()
huge_array = huge_array + thread_huge_array(:,index_ii)
ENDDO
DEALLOCATE(thread_huge_array)
DEALLOCATE(huge_array)
在最后一种情况下,由于该方法分配了更长的时间(由于分配了更大的内存),并且相对加速度较差。
您能否提供一些提示以实现更好的加速?还是用OpenMP使用这些庞大的阵列是不可能的?