在尝试将NVBLAS与英特尔Fortran编译器一起使用时,我似乎遗漏了一些东西。
我似乎正在链接并正确使用nvblas.conf,因为我在运行时看到了NVBLAS初始化的反馈。但是,NVBLAS似乎没有拦截对DGEMM的调用,因为只执行了CPU实现。这是尽管使用:
NVBLAS_CPU_RATIO_CGEMM 0.0
在nvblas.conf中(或完全删除它)。
如果我通过删除:
禁用对CPU BLAS实现的访问NVBLAS_CPU_BLAS_LIB /ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs/libmkl_rt.so
程序在运行时崩溃,正如我所料。
我目前使用的编译器选项如下所示,我也尝试手动链接MKL,但结果相同。
# Compiler options
FFLAGS=-O3 -axAVX,SSE4.2 -msse3 -align array32byte -fpe1 -fno-alias -openmp -mkl=parallel -heap-arrays 32
# Linker options
LDFLAGS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas
# List of libraries used
LIBS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas
对DGEMM的调用示例如下:
call dgemm('N','T',nCols2,nCols1,nOcc(s),2.0d0/dble(nSpins),C2,nRowsP,C(:,:,s),nRowsP,0.0d0,P(i21,i11,s),nOrbsP)
不幸的是,我目前仅限于使用英特尔编译器,但很快就会解除限制(此时我将使用CUDA Fortran来优化数据移动)。
答案 0 :(得分:1)
我不确定这里发生了什么。如果我采用一个非常简单的DGEMM示例(直接从MKL fortran指南中填写):
PROGRAM MAIN
IMPLICIT NONE
DOUBLE PRECISION ALPHA, BETA
INTEGER M, K, N, I, J
PARAMETER (M=8000, K=8000, N=8000)
DOUBLE PRECISION A(M,K), B(K,N), C(M,N)
PRINT *, "Initializing data for matrix multiplication C=A*B for "
PRINT 10, " matrix A(",M," x",K, ") and matrix B(", K," x", N, ")"
10 FORMAT(a,I5,a,I5,a,I5,a,I5,a)
PRINT *, ""
ALPHA = 1.0
BETA = 0.0
PRINT *, "Intializing matrix data"
PRINT *, ""
DO I = 1, M
DO J = 1, K
A(I,J) = (I-1) * K + J
END DO
END DO
DO I = 1, K
DO J = 1, N
B(I,J) = -((I-1) * N + J)
END DO
END DO
DO I = 1, M
DO J = 1, N
C(I,J) = 0.0
END DO
END DO
PRINT *, "Computing matrix product using DGEMM subroutine"
CALL DGEMM('N','N',M,N,K,ALPHA,A,M,B,K,BETA,C,M)
PRINT *, "Computations completed."
PRINT *, ""
PRINT *, "Top left corner of matrix A:"
PRINT 20, ((A(I,J), J = 1,MIN(K,6)), I = 1,MIN(M,6))
PRINT *, ""
PRINT *, "Top left corner of matrix B:"
PRINT 20, ((B(I,J),J = 1,MIN(N,6)), I = 1,MIN(K,6))
PRINT *, ""
20 FORMAT(6(F12.0,1x))
PRINT *, "Top left corner of matrix C:"
PRINT 30, ((C(I,J), J = 1,MIN(N,6)), I = 1,MIN(M,6))
PRINT *, ""
30 FORMAT(6(ES12.4,1x))
PRINT *, "Example completed."
STOP
END
如果我使用英特尔编译器(12.1)构建代码并在nvprof下运行(注意我目前无法访问MKL,所以我使用的是使用ifort构建的OpenBLAS):
$ ifort -o nvblas_test nvblas_test.f -L/opt/cuda-7.5/lib64 -lnvblas
$ echo -e "NVBLAS_CPU_BLAS_LIB /opt/openblas/lib/libopenblas.so\nNVBLAS_AUTOPIN_MEM_ENABLED\n" > nvblas.conf
$ nvprof --print-gpu-summary ./nvblas_test
==23978== NVPROF is profiling process 23978, command: ./nvblas_test
[NVBLAS] Config parsed
Initializing data for matrix multiplication C=A*B for
matrix A( 8000 x 8000) and matrix B( 8000 x 8000)
Intializing matrix data
Computing matrix product using DGEMM subroutine
Computations completed.
Top left corner of matrix A:
1. 2. 3. 4. 5. 6.
8001. 8002. 8003. 8004. 8005. 8006.
16001. 16002. 16003. 16004. 16005. 16006.
24001. 24002. 24003. 24004. 24005. 24006.
32001. 32002. 32003. 32004. 32005. 32006.
40001. 40002. 40003. 40004. 40005. 40006.
Top left corner of matrix B:
-1. -2. -3. -4. -5. -6.
-8001. -8002. -8003. -8004. -8005. -8006.
-16001. -16002. -16003. -16004. -16005. -16006.
-24001. -24002. -24003. -24004. -24005. -24006.
-32001. -32002. -32003. -32004. -32005. -32006.
-40001. -40002. -40003. -40004. -40005. -40006.
Top left corner of matrix C:
-1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15
-3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15
-5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15
-7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15
-9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15
-1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16
Example completed.
==23978== Profiling application: ./nvblas_test
==23978== Profiling result:
Time(%) Time Calls Avg Min Max Name
92.15% 8.56855s 512 16.736ms 9.6488ms 21.520ms void magma_lds128_dgemm_kernel<bool=0, bool=0, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
7.38% 685.77ms 1025 669.04us 896ns 820.55us [CUDA memcpy HtoD]
0.47% 44.017ms 64 687.77us 504.56us 763.05us [CUDA memcpy DtoH]
我得到了我的期望 - 将DGEMM调用卸载到GPU。当我这样做时:
$ echo "NVBLAS_GPU_DISABLED_DGEMM" >> nvblas.conf
$ nvprof --print-gpu-summary ./nvblas_test
==23991== NVPROF is profiling process 23991, command: ./nvblas_test
[NVBLAS] Config parsed
Initializing data for matrix multiplication C=A*B for
matrix A( 8000 x 8000) and matrix B( 8000 x 8000)
Intializing matrix data
Computing matrix product using DGEMM subroutine
Computations completed.
Top left corner of matrix A:
1. 2. 3. 4. 5. 6.
8001. 8002. 8003. 8004. 8005. 8006.
16001. 16002. 16003. 16004. 16005. 16006.
24001. 24002. 24003. 24004. 24005. 24006.
32001. 32002. 32003. 32004. 32005. 32006.
40001. 40002. 40003. 40004. 40005. 40006.
Top left corner of matrix B:
-1. -2. -3. -4. -5. -6.
-8001. -8002. -8003. -8004. -8005. -8006.
-16001. -16002. -16003. -16004. -16005. -16006.
-24001. -24002. -24003. -24004. -24005. -24006.
-32001. -32002. -32003. -32004. -32005. -32006.
-40001. -40002. -40003. -40004. -40005. -40006.
Top left corner of matrix C:
-1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15
-3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15
-5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15
-7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15
-9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15
-1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16
Example completed.
==23991== Profiling application: ./nvblas_test
==23991== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 768ns 1 768ns 768ns 768ns [CUDA memcpy HtoD]
我没有卸载GPU。如果你无法重现这个问题,那么问题就出在你的编译器版本上(你没有说过你正在使用哪一个版本),如果可以的话,那么你使用的有些更有趣的构建选项可能会与NVBLAS进行交互。意想不到的方式