在Fortran中减少大型阵列的OpenMP

时间:2019-05-25 11:36:17

标签: arrays fortran openmp fortran90 reduction

我知道有时也会问类似的问题:Openmp array reductions with FortranReducing on array in OpenMP,甚至在英特尔论坛(https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/345415)中,但我想知道您的意见,因为可扩展性我得到的不是我期望的。

因此,我需要填充非常大的复数数组,我想将其与OpenMP并行化。我们的第一种方法是:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), ALLOCATABLE :: thread_huge_array(:)
INTEGER :: huge_number, index1, index2, index3, index4, index5, bignumber1, bignumber2, smallnumber, depending_index

ALLOCATE(huge_array(huge_number))

!$OMP PARALLEL FIRSTPRIVATE(thread_huge_array)
      ALLOCATE(thread_huge_array(SIZE(huge_array)))
      thread_huge_array = ZERO
!$OMP DO
      DO index1=1,bignumber1
         ! Some calculations
         DO index2=1,bignumber2
            ! Some calculations
            DO index3=1,6
               DO index4=1,6
                  DO index5=1,smallnumber
                     depending_index = function(index1, index2, index3, index4, index5)
                     thread_huge_array(depending_index) = thread_huge_array(depending_index)
                  ENDDO
               ENDDO 
            ENDDO
         ENDDO 
      ENDDO 
!$OMP END DO
!$OMP BARRIER
!$OMP MASTER
      huge_array = ZERO
!$OMP END MASTER
!$OMP CRITICAL
      huge_array = huge_array + thread_huge_array
!$OMP END CRITICAL
      DEALLOCATE(thread_huge_array)
!$OMP END PARALLEL

因此,通过这种方法,我们可以获得良好的可扩展性,直到8个内核,合理的可扩展性直到32个内核,并从40个内核开始,这要比16个内核(我们的机器具有80个物理内核)要慢。当然,我们不能使用REDUCTION子句,因为数组的大小太大,以致不能容纳在堆栈中(甚至将ulimit增加到机器允许的最大值)。

我们为此尝试了另一种方法:

COMPLEX(KIND=DBL), ALLOCATABLE :: huge_array(:)
COMPLEX(KIND=DBL), POINTER:: thread_huge_array(:)
INTEGER :: huge_number

ALLOCATE(huge_array(huge_number))

ALLOCATE(thread_huge_array(SIZE(huge_array),omp_get_max_threads()))
thread_huge_array = ZERO

!$OMP PARALLEL PRIVATE (num_thread)

      num_thread = omp_get_thread_num()+1
!$OMP DO
      DO index1=1,bignumber1
         ! Some calculations
         DO index2=1,bignumber2
            ! Some calculations
            DO index3=1,6
               DO index4=1,num_weights_sp
                  DO index5=1,smallnumber
                     depending_index = function(index1, index2, index3, index4, index5)
                     thread_huge_array(depending_index, omp_get_thread_num()) = thread_huge_array(depending_index, omp_get_thread_num())
                  ENDDO
               ENDDO 
            ENDDO
         ENDDO 
      ENDDO 
!$OMP END DO
!$OMP END PARALLEL

huge_array = ZERO

DO index_ii = 1,omp_get_max_threads()
   huge_array = huge_array + thread_huge_array(:,index_ii)
ENDDO

DEALLOCATE(thread_huge_array)

DEALLOCATE(huge_array)

在最后一种情况下,由于该方法分配了更长的时间(由于分配了更大的内存),并且相对加速度较差。

您能否提供一些提示以实现更好的加速?还是用OpenMP使用这些庞大的阵列是不可能的?

0 个答案:

没有答案