我有一个fortran程序的一部分,其中包含一些我希望与OpenMP并行化的嵌套循环。
integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col
double complex, dimension(4,4):: mat
double complex, dimension(:), allocatable :: vecin,vecout
nstates = 2
N = 24
allocate(vecin(nstates**N), vecout(nstates**N))
vecin = ...some data
vecout = 0
mat = reshape([...some data...],[4,4])
dimb=nstates**2
!$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col)
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
do b_col = 1,dimb
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
!$OMP ATOMIC
vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
end do
end do
end do
end do
end do
!$OMP END PARALLEL DO
程序运行,我得到的结果也是正确的,它只是令人难以置信的缓慢。比没有OpenMP要慢得多。我对OpenMP一无所知。我使用PRIVATE或OMP ATOMIC做错了吗?对于如何提高代码性能的每一条建议,我将不胜感激。
答案 0 :(得分:2)
如果您的数组太大而且您的堆栈溢出会自动减少,您可以使用可分配的临时数组自行实现减少。
正如弗朗索瓦·雅克指出的那样,你也会因dima
和dimb
引起的竞争条件属于私人状态。
double complex, dimension(:), allocatable :: tmp
!$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)
allocate(tmp(size(vecout)))
tmp = 0
!$OMP DO
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
do b_col = 1,dimb
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col)
end do
end do
end do
end do
end do
!$OMP END DO
!$OMP CRITICAL
vecout = vecout + tmp
!$OMP END CRITICAL
!$OMP END PARALLEL
答案 1 :(得分:1)
你可以试试像:
do b_col=1,dimb
do i=1,N-1
dima=nstates**(i-1)
dimc=nstates**(N-i-1)
!$OMP PARALLEL DO COLLAPSE(3) PRIVATE(row,col,a_row,b_row,c_row)
do a_row = 1, dima
do b_row = 1,dimb
do c_row = 1,dimc
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
enddo
enddo
enddo
enddo
enddo
优点是//循环现在不会导致冲突:所有索引行都不同。