如何让Alea更快?

时间:2015-08-23 20:02:07

标签: aleagpu

在Alea中完成了一些实现各种ML算法的工作后,我尝试在Alea中对一些简单但基本的例程进行基准测试。我惊讶地发现Alea'比同等的cuBLAS调用sgeam大约需要3倍的时间做同样的事情。如果我正在做一些比矩阵乘法更复杂的事情,我必须处理共享内存,这将是可以理解的,但以下只是简单的数组转换。

let dmat = createRandomUniformMatrix 100 1000 1.0f 0.0f
let dmat2 = createRandomUniformMatrix 100 1000 1.0f 0.0f
let rmat = createEmptyMatrixLike dmat

let m = new DeviceUnaryTransformModule<float32> <@ fun x -> x*2.0f @>

#time
//4.85s/100k
for i=1 to 100000 do
    m.Apply(dmat, rmat) |> ignore
#time

#time
//1.8s/100k
for i=1 to 100000 do
    sgeam2 nT nT 2.0f dmat 0.0f dmat2 rmat  |> ignore
#time

DeviceUnaryTransformModule转换模块的内核与基本转换示例相同,唯一的区别是之后不是收集到主机,而是将数据保存在设备上。

对于我来说,Unbound的减少效果确实很差,事实上,我一直在使用它的方式一定有错误。它比使用sgeamv两次对矩阵求和要慢大约20倍。

let makeReduce (op:Expr<'T -> 'T -> 'T>)  =
    let compileReductionKernel (op:Expr<'T -> 'T -> 'T>) =
        worker.LoadProgram(
                        DeviceReduceImpl.DeviceReduce(op, worker.Device.Arch, PlatformUtil.Instance.ProcessBitness).Template
                        )

    let prog = compileReductionKernel op

    let runReduceProgram (sumProg : Program<DeviceReduceImpl.IDeviceReduceFactory<'A>>) (x: DeviceMemory<'A>) = 
        sumProg.Entry.Create(blob, x.Length)
               .Reduce(None, x.Ptr, x.Length)

    let reduceProg (x: DeviceMemory<'T>) = runReduceProgram prog x
    reduceProg

let sumReduce: DeviceMemory<float32> -> float32 = makeReduce <@ fun (a:float32) b -> a + b @>

#time
//3.5s/10k
for i=1 to 10000 do
    sumReduce dmat.dArray |> ignore
#time

我还没有尝试将它与CUDA C ++进行比较,但对于简单的事情,我认为它应该与cuBLAS相提并论。我认为优化标志可能已关闭,但后来发现它默认打开。我在这里缺少任何优化技巧?

1 个答案:

答案 0 :(得分:3)

我认为您的测试代码存在一些问题:

  1. 在映射模块中,您应该预加载GPUModule。 GPUModule在第一次启动时进行JIT编译。实际上,您的计时测量包括GPU代码编译时间;

  2. 在映射模块中,无论是Alea代码还是cublas代码,都应该同步worker(同步CUDA上下文)。 CUDA编程是异步风格。因此,当您启动内核时,它会立即返回,而无需等待内核完成。如果您不同步工作者,实际上您正在测量内核启动时间,而不是内核执行时间。 Alea gpu的启动时间将比本机C代码慢,因为它会对内核参数进行一些编组。还有一些与内核启动时间相关的问题,我将在下面的示例代码中向您展示。

  3. 您的reduce测试实际上每次都会加载reduce模块!这意味着,每次进行缩减时,都会测量包括GPU编译时间在内的时间!建议您使GPU模块或程序的实例长寿,因为它们代表已编译的GPU代码。

  4. 因此,我根据您的使用情况进行了测试。在这里,我首先列出完整的测试代码:

    #r @"packages\Alea.CUDA.2.1.2.3274\lib\net40\Alea.CUDA.dll"
    #r @"packages\Alea.CUDA.IL.2.1.2.3274\lib\net40\Alea.CUDA.IL.dll"
    #r @"packages\Alea.CUDA.Unbound.2.1.2.3274\lib\net40\Alea.CUDA.Unbound.dll"
    #r "System.Configuration"
    open System.IO
    Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(@"packages\Alea.CUDA.2.1.2.3274", "private")
    Alea.CUDA.Settings.Instance.Resource.Path <- Path.GetTempPath()
    
    open Alea.CUDA
    open Alea.CUDA.Utilities
    open Alea.CUDA.CULib
    open Alea.CUDA.Unbound
    open Microsoft.FSharp.Quotations
    
    type MapModule(target, op:Expr<float32 -> float32>) =
        inherit GPUModule(target)
    
        [<Kernel;ReflectedDefinition>]
        member this.Kernel (C:deviceptr<float32>) (A:deviceptr<float32>) (B:deviceptr<float32>) (n:int) =
            let start = blockIdx.x * blockDim.x + threadIdx.x
            let stride = gridDim.x * blockDim.x
            let mutable i = start
            while i < n do
                C.[i] <- __eval(op) A.[i] + __eval(op) B.[i]
                i <- i + stride
    
        member this.Apply(C:deviceptr<float32>, A:deviceptr<float32>, B:deviceptr<float32>, n:int) =
            let lp = LaunchParam(64, 256)
            this.GPULaunch <@ this.Kernel @> lp C A B n
    
    let inline mapTemplate (op:Expr<'T -> 'T>) = cuda {
        let! kernel = 
            <@ fun (C:deviceptr<'T>) (A:deviceptr<'T>) (B:deviceptr<'T>) (n:int) ->
                let start = blockIdx.x * blockDim.x + threadIdx.x
                let stride = gridDim.x * blockDim.x
                let mutable i = start
                while i < n do
                    C.[i] <- (%op) A.[i] + (%op) B.[i]
                    i <- i + stride @>
            |> Compiler.DefineKernel
    
        return Entry(fun program ->
            let worker = program.Worker
            let kernel = program.Apply kernel
            let lp = LaunchParam(64, 256)
    
            let run C A B n =
                kernel.Launch lp C A B n
    
            run ) }
    
    let test1 (worker:Worker) m n sync iters =
        let n = m * n
        use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
        let rng = System.Random(42)
        use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use C = worker.Malloc<float32>(n)
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (no pre-load module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test2 (worker:Worker) m n sync iters =
        let n = m * n
        use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
        // we pre-load the module, this will JIT compile the GPU code
        m.GPUForceLoad()
        let rng = System.Random(42)
        use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use C = worker.Malloc<float32>(n)
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (pre-loaded module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test3 (worker:Worker) m n sync iters =
        let n = m * n
        use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
        // we pre-load the module, this will JIT compile the GPU code
        m.GPUForceLoad()
        let rng = System.Random(42)
        use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use C = worker.Malloc<float32>(n)
        // since the worker is running in a background thread
        // each cuda api will switch to that thread
        // use eval() to avoid the many thread switching
        worker.Eval <| fun _ ->
            let timer = System.Diagnostics.Stopwatch.StartNew()
            for i = 1 to iters do
                m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
            if sync then worker.Synchronize()
            timer.Stop()
            printfn "%f ms / %d %s (pre-loaded module + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test4 (worker:Worker) m n sync iters =
        use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
        let n = m * n
        let rng = System.Random(42)
        use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use C = worker.Malloc<float32>(n)
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            program.Run C.Ptr A.Ptr B.Ptr n
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (template usage)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test5 (worker:Worker) m n sync iters =
        use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
        let n = m * n
        let rng = System.Random(42)
        use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use C = worker.Malloc<float32>(n)
        worker.Eval <| fun _ ->
            let timer = System.Diagnostics.Stopwatch.StartNew()
            for i = 1 to iters do
                program.Run C.Ptr A.Ptr B.Ptr n
            if sync then worker.Synchronize()
            timer.Stop()
            printfn "%f ms / %d %s (template usage + worker.Eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test6 (worker:Worker) m n sync iters =
        use cublas = new CUBLAS(worker)
        let rng = System.Random(42)
        use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
        use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
        use dmatr = worker.Malloc<float32>(m * n)
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (cublas)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test7 (worker:Worker) m n sync iters =
        use cublas = new CUBLAS(worker)
        let rng = System.Random(42)
        use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
        use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
        use dmatr = worker.Malloc<float32>(m * n)
        worker.Eval <| fun _ ->
            let timer = System.Diagnostics.Stopwatch.StartNew()
            for i = 1 to iters do
                cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
            if sync then worker.Synchronize()
            timer.Stop()
            printfn "%f ms / %d %s (cublas + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")
    
    let test worker m n sync iters =
        test6 worker m n sync iters
        test7 worker m n sync iters
        test1 worker m n sync iters
        test2 worker m n sync iters
        test3 worker m n sync iters
        test4 worker m n sync iters
        test5 worker m n sync iters
    
    let testReduce1 (worker:Worker) n iters =
        let rng = System.Random(42)
        use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
        // JIT compile and load GPU code for this module
        reduceModule.GPUForceLoad()
        // create a reducer which will allocate temp memory for maxNum=n
        let reduce = reduceModule.Create(n)
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to 10000 do
            reduce.Reduce(input.Ptr, n) |> ignore
        timer.Stop()
        printfn "%f ms / %d (pre-load gpu code)" timer.Elapsed.TotalMilliseconds iters
    
    let testReduce2 (worker:Worker) n iters =
        let rng = System.Random(42)
        use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
        use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
        // JIT compile and load GPU code for this module
        reduceModule.GPUForceLoad()
        // create a reducer which will allocate temp memory for maxNum=n
        let reduce = reduceModule.Create(n)
        worker.Eval <| fun _ ->
            let timer = System.Diagnostics.Stopwatch.StartNew()
            for i = 1 to 10000 do
                reduce.Reduce(input.Ptr, n) |> ignore
            timer.Stop()
            printfn "%f ms / %d (pre-load gpu code and avoid thread switching)" timer.Elapsed.TotalMilliseconds iters
    
    let testReduce worker n iters =
        testReduce1 worker n iters
        testReduce2 worker n iters
    
    let workerDefault = Worker.Default
    let workerNoThread = Worker.CreateOnCurrentThread(Device.Default)
    

    在Alea GPU中,一个worker代表一个CUDA上下文,目前,我们正在使用一个GPU使用一个专用线程的模式,并在该线程上附加了CUDA上下文。我们称这个&#34;工人有专门的线程&#34;。所以这也意味着,每次调用CUDA API(如内核启动)时,我们都必须切换到工作线程。如果您正在进行大量内核启动,建议使用Worker.Eval函数在工作线程内执行代码以避免线程切换。还有一个在当前线程上创建worker的实验性功能,从而避免了线程切换,但我们仍然优化了这种用法。有关详细信息,请参阅here

    现在我们首先使用默认工作程序在不同步worker的情况下进行测试(因此这意味着我们只比较内核启动时间)。默认工作者是具有专用线程的工作者,因此当我们使用Worker.Eval时,您可以更好地看到它的表现。但总体而言,从.net启动的内核比本机C内核启动要慢:

    > test workerDefault 10000 10000 false 100;;
    4.487300 ms / 100 nosync (cublas)
    0.560600 ms / 100 nosync (cublas + worker.eval)
    304.427900 ms / 100 nosync (no pre-load module)
    18.517000 ms / 100 nosync (pre-loaded module)
    12.579100 ms / 100 nosync (pre-loaded module + worker.eval)
    27.023800 ms / 100 nosync (template usage)
    16.007500 ms / 100 nosync (template usage + worker.Eval)
    val it : unit = ()
    > test workerDefault 10000 10000 false 100;;
    3.288600 ms / 100 nosync (cublas)
    0.647300 ms / 100 nosync (cublas + worker.eval)
    29.129100 ms / 100 nosync (no pre-load module)
    18.874700 ms / 100 nosync (pre-loaded module)
    12.285000 ms / 100 nosync (pre-loaded module + worker.eval)
    20.452300 ms / 100 nosync (template usage)
    14.903500 ms / 100 nosync (template usage + worker.Eval)
    val it : unit = ()
    

    另外,您可能会注意到,我运行此测试两次,并且在第一次,没有预加载模块的测试使用304 ms,但在第二次,没有预加载模块的测试仅使用29 ms。原因是,我们使用LLVM P / Invoke来编译内核。而那些P / Invoke函数是惰性函数,所以当你第一次使用它们时会有一些初始化,之后它变得更快。

    现在,我们同步了实际测量真实内核执行时间的worker,现在它们是相似的。我在这里创建的内核非常简单,但它对矩阵A和B都进行了操作:

    > test workerDefault 10000 10000 true 100;;
    843.695000 ms / 100 sync (cublas)
    841.452400 ms / 100 sync (cublas + worker.eval)
    919.244900 ms / 100 sync (no pre-load module)
    912.348000 ms / 100 sync (pre-loaded module)
    908.909000 ms / 100 sync (pre-loaded module + worker.eval)
    914.834100 ms / 100 sync (template usage)
    914.170100 ms / 100 sync (template usage + worker.Eval)
    

    现在,如果我们在无线工作者上测试它们,它们会有点快,因为没有线程切换:

    > test workerNoThread 10000 10000 true 100;;
    842.132100 ms / 100 sync (cublas)
    841.627200 ms / 100 sync (cublas + worker.eval)
    918.007800 ms / 100 sync (no pre-load module)
    908.575900 ms / 100 sync (pre-loaded module)
    908.770100 ms / 100 sync (pre-loaded module + worker.eval)
    913.405300 ms / 100 sync (template usage)
    913.942600 ms / 100 sync (template usage + worker.Eval)
    

    现在这里是对reduce的测试:

    > testReduce workerDefault 10000000 100;;
    7691.335300 ms / 100 (pre-load gpu code)
    6448.782500 ms / 100 (pre-load gpu code and avoid thread switching)
    val it : unit = ()
    > testReduce workerNoThread 10000000 100;;
    6467.105300 ms / 100 (pre-load gpu code)
    6426.296900 ms / 100 (pre-load gpu code and avoid thread switching)
    val it : unit = ()
    

    请注意,在此缩减测试中,每次缩减都有一个内存收集(memcpyDtoH),以便从设备到主机获取结果。并且此内存复制API调用会自动同步工作程序,因为如果内核尚未完成,则该值无意义。因此,如果要将性能与C代码进行比较,还应将结果标量从设备复制到主机。虽然它只是一个CUDA api调用,但正如你在许多迭代中所做的那样(在这个例子中为100),它会在那里累积一些时间。

    希望这回答你的问题。