Question

这是我在GPU上运行的代码

tid=threadidx%x
bid=blockidx%x
bdim=blockdim%x

isec = mesh_sec_1(lev)+bid-1
if (isec .le. mesh_sec_0(lev)) then
    if(.not.  sec_is_int(isec))  return

    do iele = tid, sec_n_ele(isec), bdim

        idx =  n_ele_idx(isec)+iele

        u(1:5)  =   fv_u(1:5,idx)
        u(6  )  =   fv_t(    idx)
        g       =  0.0d0
        do j=  sec_iA_ls(idx), sec_iA_ls(idx+1)-1
            ss  =   sec_jA_ls(1,j)
            ee  =   sec_jA_ls(2,j)
            tem =   n_ele_idx(ss)+ee
            du(1:5) =  fv_u(1:5, n_ele_idx(ss)+ee)-u(1:5)
            du(6  ) =  fv_t(     n_ele_idx(ss)+ee)-u(6  )
            coe(1:3) = sec_coe_ls(1:3,j)
            do k=1,6
                g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
            end do
        end do
        do j=1,6
        do i=1,3
            fv_gra(i+(j-1)*3,idx)=g(i,j)
        end do
        end do
    end do
end if

接下来是我在CPU上运行的代码

    do isec = h_mesh_sec_1(lev),h_mesh_sec_0(lev)
    if(.not. h_sec_is_int(isec))  cycle
    do iele=1,h_sec_n_ele(isec)

        idx = h_n_ele_idx(isec)+iele

        u(1:5)  =  h_fv_u(1:5,idx)
        u(6  )  =  h_fv_t(    idx)
        g       =  0.0d0
        do j= h_sec_iA_ls(idx),h_sec_iA_ls(idx+1)-1
            ss  =  h_sec_jA_ls(1,j)
            ee  =  h_sec_jA_ls(2,j)
            du(1:5) =  h_fv_u(1:5,h_n_ele_idx(ss)+ee)-u(1:5)
            du(6  ) =  h_fv_t(    h_n_ele_idx(ss)+ee)-u(6  )
            do k=1,6
                g(1:3,k)= g(1:3,k) + du(k)*h_sec_coe_ls(1:3,j)
            end do
        end do
        do j=1,6
        do i=1,3
            h_fv_gra(i+(j-1)*3,idx) = g(i,j)
        enddo
        enddo

    end do
end do

h_ *和*之间的变量分别表示它属于cpu和gpu。结果在很多方面都是一样的，但在某些方面它们有点不同。我像这样添加支票代码。

    do i =1,size(h_fv_gra,1)
    do j = 1,size(h_fv_gra,2)
        if(hd_fv_gra(i,j)-h_fv_gra(i,j) .ge. 1.0d-9) then
            print *,hd_fv_gra(i,j)-h_fv_gra(i,j),i,j
        end if
    end do
end do

hd_ *是gpu结果的副本。我们可以看到差异：

1.8626451492309570E-009 13 14306

1.8626451492309570E-009 13 14465

1.8626451492309570E-009 13 14472

1.8626451492309570E-009 14 14128

1.8626451492309570E-009 14 14146

1.8626451492309570E-009 14 14150

1.8626451492309570E-009 14 14153

1.8626451492309570E-009 14 14155

1.8626451492309570E-009 14 14156

所以我对此感到困惑。 Cuda的精度不应该这么大。欢迎任何回复。另外，我不知道如何在GPU代码中打印变量，这可以帮助我调试。

Answer 1

在您的代码中，g值的计算最有可能受益于CUDA中的Fused Multiply Add（fma）优化。

g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)

在CPU方面，这不是不可能的，但很大程度上取决于编译器的选择（如果它实现了fma，那么运行代码的实际CPU）。

要强制使用单独的乘法和加法，您希望使用来自CUDA的内在函数，如定义的here，例如：

__device__ double __dadd_rn ( double x, double y )以round-to-nearest-even模式添加两个浮点值。

和

__device__ double __dmul_rn ( double x, double y )以舍入到最接近的偶数模式乘以两个浮点值。

的舍入模式与CPU上定义的舍入模式相同（取决于CPU架构，无论是Power还是Intel x86或其他）。

替代方法是在使用nvcc详细here中的--fmad false选项编译cuda时将ptxas选项传递给--ptxas-options。

为什么cpu和gpu之间的结果不同？

1 个答案: