如何使用cuda DevicePtr作为加速阵列

时间:2018-01-30 17:10:14

标签: haskell cuda gpu ffi accelerate-haskell

我尝试使用外部代码作为cuda发现的DevicePtr CUdeviceptr(在CUDA-land中称为accelerateArray {{ 3}}与accelerate-llvm-ptx

我在下面编写的代码有点有用:

import Data.Array.Accelerate
       (Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
       (GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
       (Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
       (UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)

-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU.  The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function.  It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
  cmyTestCuda :: IO CULLong

-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@.  This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral

-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'.  It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
  res <- cmyTestCuda
  let DevicePtr ptrToXs = cullongToDevicePtr res
  foreignPtrToXs <- newForeignPtr_ ptrToXs
  uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
  let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
  let shape = Z :. 10 :: DIM1
      xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
      ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
      usedXs = use xs :: Acc (Array DIM1 Float)
      usedYs = use ys :: Acc (Array DIM1 Float)
      computation = Acc.zipWith (+) usedXs usedYs
      zs = run computation
  putStrLn $ "zs: " <> show z

编译并运行此程序时,它会正确打印出结果:

zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]

但是,通过阅读加速和加速-llvm-ptx源代码,它似乎似乎就像这样。

在大多数情况下,似乎加速Array带有指向HOST内存中数组数据的指针,Unique值用于唯一标识Array。执行Acc计算时,加速会根据需要将来自HOST内存的数组数据加载到GPU内存中,并使用HashMap索引的Unique跟踪它。

在上面的代码中,我使用指向GPU数据的指针直接创建Array。这似乎不应该有效,但它似乎适用于上面的代码。

然而,有些事情不起作用。例如,尝试打印xs(我的Array带有指向GPU数据的指针)会因段错误而失败。这是有道理的,因为Show的{​​{1}}实例只是尝试Array来自HOST指针的数据。这失败是因为它不是HOST指针,而是GPU指针:

peek

是否有正确的方法来获取CUDA -- Trying to print xs causes a segfault. putStrLn $ "xs: " <> show xs 并将其直接用作加速DevicePtr

1 个答案:

答案 0 :(得分:1)

实际上,我感到惊讶的是,上述工作已经完成了;我无法复制它。

这里的一个问题是设备内存与执行上下文隐式关联;一个上下文中的指针在不同的上下文中无效,即使在相同的GPU上也是如此(除非您在这些上下文之间明确启用对等内存访问)。

因此,这个问题实际上有两个组成部分:

  1. 以其理解的方式将外国数据导入Accelerate;和
  2. 确保后续的Accelerate计算在可访问此内存的上下文中执行。
  3. 溶液

    以下是我们用于在GPU上生成数据的C代码:

    #include <cuda.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    CUdeviceptr generate_gpu_data()
    {
      CUresult    status = CUDA_SUCCESS;
      CUdeviceptr d_arr;
    
      const int N = 32;
      float h_arr[N];
    
      for (int i = 0; i < N; ++i) {
        h_arr[i] = (float)i;
      }
    
      status = cuMemAlloc(&d_arr, N*sizeof(float));
      if (CUDA_SUCCESS != status) {
        fprintf(stderr, "cuMemAlloc failed (%d)\n", status);
        exit(1);
      }
    
      status = cuMemcpyHtoD(d_arr, (void*) h_arr, N*sizeof(float));
      if (CUDA_SUCCESS != status) {
        fprintf(stderr, "cuMemcpyHtoD failed (%d)\n", status);
        exit(1);
      }
    
      return d_arr;
    }
    

    使用它的Haskell / Accelerate代码:

    {-# LANGUAGE ForeignFunctionInterface #-}
    
    import Data.Array.Accelerate                                        as A
    import Data.Array.Accelerate.Array.Sugar                            as Sugar
    import Data.Array.Accelerate.Array.Data                             as AD
    import Data.Array.Accelerate.Array.Remote.LRU                       as LRU
    
    import Data.Array.Accelerate.LLVM.PTX                               as PTX
    import Data.Array.Accelerate.LLVM.PTX.Foreign                       as PTX
    
    import Foreign.CUDA.Driver                                          as CUDA
    
    import Text.Printf
    
    main :: IO ()
    main = do
      -- Initialise CUDA and create an execution context. From this we also create
      -- the context that our Accelerate programs will run in.
      --
      CUDA.initialise []
      dev <- CUDA.device 0
      ctx <- CUDA.create dev []
      ptx <- PTX.createTargetFromContext ctx
    
      -- When created, a context becomes the active context, so when we call the
      -- foreign function this is the context that it will be executed within.
      --
      fp  <- c_generate_gpu_data
    
      -- To import this data into Accelerate, we need both the host-side array
      -- (typically the only thing we see) and then associate this with the existing
      -- device memory (rather than allocating new device memory automatically).
      --
      -- Note that you are still responsible for freeing the device-side data when
      -- you no longer need it.
      --
      arr@(Array _ ad) <- Sugar.allocateArray (Z :. 32) :: IO (Vector Float)
      LRU.insertUnmanaged (ptxMemoryTable ptx) ad fp
    
      -- NOTE: there seems to be a bug where we haven't recorded that the host-side
      -- data is dirty, and thus needs to be filled in with values from the GPU _if_
      -- those are required on the host. At this point we have the information
      -- necessary to do the transfer ourselves, but I guess this should really be
      -- fixed...
      --
      -- CUDA.peekArray 32 fp (AD.ptrsOfArrayData ad)
    
      -- An alternative workaround to the above is this no-op computation (this
      -- consumes no additional host or device memory, and executes no kernels).
      -- If you never need the values on the host, you could ignore this step.
      --
      let arr' = PTX.runWith ptx (use arr)
    
      -- We can now use the array as in a regular Accelerate computation. The only
      -- restriction is that we need to `run*With`, so that we are running in the
      -- context of the foreign memory.
      --
      let r = PTX.runWith ptx $ A.fold (+) 0 (use arr')
    
      printf "array is: %s\n" (show arr')
      printf "sum is:   %s\n" (show r)
    
      -- Free the foreign memory (again, it is not managed by Accelerate)
      --
      CUDA.free fp
    
    
    foreign import ccall unsafe "generate_gpu_data"
      c_generate_gpu_data :: IO (DevicePtr Float)