以下F#代码在第三次调用时崩溃且没有内存异常。要么我遗漏了某些东西,要么Alea由于某种原因没有正确释放内存。我在F#Interactive和Compiled中都尝试过它。我也尝试过手动调用dispose,但它没有用。知道为什么吗?
let squareGPU (inputs:float[]) =
use dInputs = worker.Malloc(inputs)
use dOutputs = worker.Malloc(inputs.Length)
let blockSize = 256
let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT
let gridSize = Math.Min(16 * numSm, divup inputs.Length blockSize)
let lp = new LaunchParam(gridSize, blockSize)
worker.Launch <@ squareKernel @> lp dOutputs.Ptr dInputs.Ptr inputs.Length
dOutputs.Gather()
let x = squareGPU [|0.0..0.001..100000.0|]
printfn "1"
let y = squareGPU [|0.0..0.001..100000.0|]
printfn "2"
let z = squareGPU [|0.0..0.001..100000.0|]
printfn "3"
答案 0 :(得分:2)
我猜你有System.OutOfMemoryException
,对吗?这并不意味着GPU设备内存耗尽,这意味着您的主机内存已用完。在您的示例中,您在主机中创建了一个相当大的数组,并计算它,并将另一个大数组作为输出收集。关键是,您使用不同的变量名称(x,y和z)来存储输出数组,因此GC将没有机会释放它,因此最终您将耗尽主机内存。
我做了一个非常简单的测试(我使用停止值30000而不是像你的例子中的100000),这个测试只使用主机代码,没有GPU代码:
let x1 = [|0.0..0.001..30000.0|]
printfn "1"
let x2 = [|0.0..0.001..30000.0|]
printfn "2"
let x3 = [|0.0..0.001..30000.0|]
printfn "3"
let x4 = [|0.0..0.001..30000.0|]
printfn "4"
let x5 = [|0.0..0.001..30000.0|]
printfn "5"
let x6 = [|0.0..0.001..30000.0|]
printfn "6"
我在F#interactive中运行此代码(这是一个32位进程),我得到了这个:
Microsoft (R) F# Interactive version 12.0.30815.0
Copyright (c) Microsoft Corporation. All Rights Reserved.
For help type #help;;
>
1
2
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at System.Collections.Generic.List`1.set_Capacity(Int32 value)
at System.Collections.Generic.List`1.EnsureCapacity(Int32 min)
at System.Collections.Generic.List`1.Add(T item)
at Microsoft.FSharp.Collections.SeqModule.ToArray[T](IEnumerable`1 source)
at <StartupCode$FSI_0002>.$FSI_0002.main@() in C:\Users\Xiang\Documents\Inbox\ConsoleApplication6\Script1.fsx:line 32
Stopped due to error
>
这意味着,在我创建了2个如此大的数组(x1和x2)之后,我用尽了主机内存。
为了进一步证实这一点,我使用相同的变量名,这使GC有机会收集旧数组,这次它可以工作:
let foo() =
let x = [|0.0..0.001..30000.0|]
printfn "1"
let x = [|0.0..0.001..30000.0|]
printfn "2"
let x = [|0.0..0.001..30000.0|]
printfn "3"
let x = [|0.0..0.001..30000.0|]
printfn "4"
let x = [|0.0..0.001..30000.0|]
printfn "5"
let x = [|0.0..0.001..30000.0|]
printfn "6"
>
val foo : unit -> unit
> foo()
;;
1
2
3
4
5
6
val it : unit = ()
>
如果我添加GPU内核,它仍然有效:
let foo() =
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "1"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "2"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "3"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "4"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "5"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "6"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "7"
let x = squareGPU [|0.0..0.001..30000.0|]
printfn "8"
> foo();;
1
2
3
4
5
6
7
8
val it : unit = ()
>
或者,您可以尝试使用64位进程。
答案 1 :(得分:0)
GC在一个单独的后台线程中工作,所以如果您经常使用新的巨大数组,它将很容易抛出该内存异常。
在这个大阵列的情况下,我建议你使用“就地修改”样式,这样会更稳定。我创建了一个测试来显示:(注意,由于数组非常大,你最好去项目属性页面,在Build选项卡中,取消选中“Prefer 32-bit”,确保它以64位运行处理)
open System
open Alea.CUDA
open Alea.CUDA.Utilities
open NUnit.Framework
[<ReflectedDefinition>]
let squareKernel (outputs:deviceptr<float>) (inputs:deviceptr<float>) (n:int) =
let start = blockIdx.x * blockDim.x + threadIdx.x
let stride = gridDim.x * blockDim.x
let mutable i = start
while i < n do
outputs.[i] <- inputs.[i] * inputs.[i]
i <- i + stride
let squareGPUInplaceUpdate (worker:Worker) (lp:LaunchParam) (hData:float[]) (dData:DeviceMemory<float>) =
// instead of malloc a new device memory, you just reuse the device memory dData
// and scatter new data to it.
dData.Scatter(hData)
worker.Launch <@ squareKernel @> lp dData.Ptr dData.Ptr hData.Length
// actually, there should be a counterpart of data.Scatter(hData) like data.Gather(hData)
// but unfortunately, that is missing, but there is a workaround of using worker.Gather.
worker.Gather(dData.Ptr, hData)
let squareGPUManyTimes (iters:int) =
let worker = Worker.Default
// actually during the many iters, you just malloc 2 host array (for data and expected value)
// and you malloc a device array. You keep reusing them, since they are big array.
// if you new the huge array very frequentely, GC is under pressure. and since GC works
// as a separate thread, so you will get System.OutOfMemoryException from time to time.
let hData = [|0.0..0.001..100000.0|]
let n = hData.Length
let expected = Array.zeroCreate n
use dData = worker.Malloc<float>(n)
let rng = Random()
let update () =
// in-place updating the data
for i = 0 to n - 1 do
hData.[i] <- rng.NextDouble()
expected.[i] <- hData.[i] * hData.[i]
let lp =
let blockSize = 256
let numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT
let gridSize = Math.Min(16 * numSm, divup n blockSize)
new LaunchParam(gridSize, blockSize)
for i = 1 to iters do
update()
squareGPUInplaceUpdate worker lp hData dData
Assert.AreEqual(expected, hData)
printfn "iter %d passed..." i
[<Test>]
let test() =
squareGPUManyTimes 5
请注意,异常System.OutOfMemoryException
总是意味着主机内存,如果发现内存不足,GPU内存将抛出CUDAException。
顺便说一句,每次调用DeviceMemory.Gather()时,它都会有一个新的.NET数组并填充它。通过使用此示例中显示的就地方法,您可以提供.net数组,并让它由设备中的数据填充。