如何让所有内核参与Async的并行化?

时间:2017-06-01 00:44:54

标签: asynchronous parallel-processing f#

以下函数通过首先将列表分解为大块然后处理每个块来并行处理列表。

let chunkList chunkSize (xs : list<'T>) = 
    query {
        for idx in 0..(xs.Length - 1) do
        groupBy (idx / chunkSize) into g
        select (g |> Seq.map (fun idx -> xs.[idx]))
    }
let par (foo: 'T -> 'S) (xs: list<'T>) = 
    xs
    |> List.map (fun x -> async { return foo x })
    |> Async.Parallel
    |> Async.RunSynchronously
    |> Array.toList

let parChunks chunkSize (f: 'T -> 'S) (xs: list<'T>) =
    chunkList chunkSize xs |> Seq.map List.ofSeq |> List.ofSeq
    |> par (List.map f)
    |> List.concat

此功能用于测试parChunks

let g i = [1..1000000] |> List.map (fun x -> sqrt (float (1000 * x + 1))) |> List.head

运行标准List.Seq和`parChunk``,其块大小等于列表大小的1/2,从而获得了性能提升:

  

List.map g [1..100] ;;   // Real:00:00:28.979,CPU:00:00:29.562

     

parChunks 50 g [1..100] ;;   // Real:00:00:23.027,CPU:00:00:24.687

但是,如果块大小等于列表大小的1/4,则性能几乎相同。我没想到这一点,因为我的处理器(Intel 6700HQ)有四个核心。

  

parChunks 25 g [1..100] ;;   // Real:00:00:21.695,CPU:00:00:24.437

查看Performance中的Task Manager应用,可以看到四个内核从未使用过。

有没有办法让所有四个核心参与这个计算?

3 个答案:

答案 0 :(得分:5)

我认为你过分复杂了这个问题。

async工作流的主要用途不是针对CPU绑定工作,而是针对IO绑定工作,以避免在等待以一定延迟到达的结果时阻塞线程。

虽然您可以使用async并行化CPU绑定工作,但这样做并不是最理想的。

使用Array.Parallel s而不是Array s上的List模块可以更轻松地实现您的目标。

let g i = 
    [|1..1000000|] 
    |> Array.Parallel.map (fun x -> sqrt (float (1000 * x + 1))) 
    |> Array.head

无需编写自己的分块和合并代码,这些代码全部为您处理,并且通过我的测量,它的速度要快得多。

答案 1 :(得分:3)

在F#中,async个工作流使用the .Net ThreadPool class运行,https://stackoverflow.com/a/26041852/2314532具有GetMinThreadsGetMaxThreads方法。它们使用两个out参数来返回允许线程池使用的最小或最大线程数,但在F#中转换为返回元组的函数:

F# Interactive for F# 4.1
Freely distributed under the Apache 2.0 Open Source License

For help type #help;;

> open System.Threading ;;
> ThreadPool.GetMinThreads() ;;
val it : int * int = (4, 4)

> ThreadPool.GetMaxThreads() ;;
val it : int * int = (400, 200)

这两个数字分别用于“工作”线程和“异步I / O”线程。我的CPU有四个核心,因此池中两种线程的最小数量是4.我不确定这是你的问题,但尝试在你的系统上运行ThreadPool.GetMinThreads()并确保它是4.如果由于某种原因它是2,这可以解释为什么你没有获得更好的表现。

另请参阅RRB Tree,了解使用async工作流进行并行处理可能导致的其他性能问题。这也可能就是这里发生的事情。

最后,我还想提一件事。就目前而言,我真的很惊讶你从并行性中获得了任何好处。那是因为划分列表并再次连接它需要花费一些成本。由于F#列表类型是单链表,因此该成本为O(N),并且这些步骤(除法和重组)无法并行化。

该问题的答案是对您计划并行处理的任何项目列表使用不同的数据结构(如this GitHub issue):它旨在有效地拆分和连接(实际上是O(1) )分裂和连接,虽然连接中的常数因子相当大)。不幸的是,目前F#中没有RRB树的实现。我目前正在研究一个,并估计它可能会在另一个月左右准备好。如果您想知道我何时发布了我一直在处理的代码,您可以订阅this

答案 2 :(得分:2)

这里的答案很好但我会在性能和并行性方面添加一些评论。

对于性能一般而言,我们希望避免动态分配,因为我们不想浪费宝贵的周期来分配对象(在.NET中速度很快,在C / C ++中很慢)或者收集它们(非常慢)。 / p>

我们还希望最大限度地减少对象的内存占用,并确保它们按顺序存放在内存中(数组是我们的朋友),以便尽可能高效地使用CPU缓存和预取程序。缓存未命中可能需要几百个周期。

我认为始终与一个简单,顺序但有效实现的循环进行比较是非常重要的,以便对并行性能进行一些健全性检查。否则,我们可能会欺骗自己认为我们的并行杰作表现良好,而实际上它已经被一个简单的循环所取代。

此外,由于缓存问题而改变输入数据的大小,但也因为启动并行计算会产生开销。

话虽如此,我已经准备了以下代码的不同版本:

module SequentialFold =
  let compute (vs : float []) : float =
    vs |> Array.fold (fun s v -> s + sqrt (1000. * v + 1.)) 0. 

然后我比较了不同版本的性能,以便了解在性能和GC压力方面哪种尺寸最佳。

性能测试的完成方式是,无论输入大小如何,总工作量始终相同,以使时间具有可比性。

以下是代码:

open System
open System.Threading.Tasks

let clock =
  let sw = System.Diagnostics.Stopwatch ()
  sw.Start ()
  fun () -> sw.ElapsedMilliseconds

let timeIt n a = 
  let r                 = a ()  // Warm-up

  GC.Collect (2, GCCollectionMode.Forced, true)

  let inline cc g       = GC.CollectionCount g
  let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
  let before            = clock ()

  for i = 1 to n do
    a () |> ignore

  let after             = clock ()
  let acc0, acc1, acc2  = cc 0, cc 1, cc 2

  after - before, acc0 - bcc0, acc1 - bcc1, acc2 - bcc2, r

// compute implemented using tail recursion
module TailRecursion =
  let compute (vs : float []) : float =
    let rec loop s i =
      if i < vs.Length then
        let v = vs.[i]
        loop (s + sqrt (1000. * v + 1.)) (i + 1)
      else
        s
    loop 0. 0

// compute implemented using Array.fold
module SequentialFold =
  let compute (vs : float []) : float =
    vs |> Array.fold (fun s v -> s + sqrt (1000. * v + 1.)) 0. 

// compute implemented using Array.map + Array.fold
module SequentialArray =
  let compute (vs : float []) : float =
    vs |> Array.map (fun v -> sqrt (1000. * v + 1.)) |> Array.fold (+) 0. 

// compute implemented using Array.Parallel.map + Array.fold
module ParallelArray =
  let compute (vs : float []) : float =
    vs |> Array.Parallel.map (fun v -> sqrt (1000. * v + 1.)) |> Array.fold (+) 0. 

// compute implemented using Parallel.For
module ParallelFor =
  let compute (vs : float []) : float =
    let lockObj         = obj ()
    let mutable sum     = 0.
    let options         = ParallelOptions()
    let init ()         = 0.
    let body i pls s    =
      let v = i |> float
      s + sqrt (1000. * v + 1.)
    let localFinally ls =
      lock lockObj <| fun () -> sum <- sum + ls
    let pls = Parallel.For  (                                             0
                            ,                                             vs.Length
                            ,                                             options
                            , Func<float>                                 init          
                            , Func<int, ParallelLoopState, float, float>  body          
                            , Action<float>                               localFinally  
                            )
    sum

// compute implemented using Parallel.For with batches of size 100
module ParallelForBatched =
  let compute (vs : float []) : float =
    let inner           = 100
    let outer           = vs.Length / inner + (if vs.Length % inner = 0 then 0 else 1)
    let lockObj         = obj ()
    let mutable sum     = 0.
    let options         = ParallelOptions()
    let init ()         = 0.
    let rec loop e s i  =
      if i < e then
        let v = vs.[i]
        loop e (s + sqrt (1000. * v + 1.)) (i + 1)
      else
        s
    let body i pls s    =
      let b = i * inner
      let e = b + inner |> min vs.Length
      loop e s b
    let localFinally ls =
      lock lockObj <| fun () -> sum <- sum + ls
    let pls = Parallel.For  (                                             0
                            ,                                             outer
                            ,                                             options
                            , Func<float>                                 init          
                            , Func<int, ParallelLoopState, float, float>  body          
                            , Action<float>                               localFinally  
                            )
    sum

[<EntryPoint>]
let main argv =
  let count   = 100000000
  let outers  =
    [|
      //10000000
      100000
      1000
      10
    |]

  for outer in outers do
    let inner     = count / outer
    let vs        = Array.init inner float
    let testCases = 
      [|
        "TailRecursion"         , fun ()  -> TailRecursion.compute    vs 
        "Fold.Sequential"       , fun ()  -> SequentialFold.compute   vs
        "Array.Sequential"      , fun ()  -> SequentialArray.compute  vs
        "Array.Parallel"    , fun ()  -> ParallelArray.compute    vs
        "Parallel.For"          , fun ()  -> ParallelFor.compute      vs
        "Parallel.For.Batched"  , fun ()  -> ParallelForBatched.compute      vs
      |]
    printfn "Using outer = %A, inner = %A, total is: %A" outer inner count
    for nm, a in testCases do
      printfn "  Running test case: %A" nm
      let tm, cc0, cc1, cc2, r = timeIt outer a
      printfn "   it took %A ms with GC collects (%A, %A, %A), result is: %A" tm cc0 cc1 cc2 r
  0

以下是结果(Intel I5,4核):

Using outer = 100000, inner = 1000, total is: 100000000
  Running test case: "TailRecursion"
   it took 389L ms with GC collects (0, 0, 0), result is: 666162.111
  Running test case: "Fold.Sequential"
   it took 388L ms with GC collects (0, 0, 0), result is: 666162.111
  Running test case: "Array.Sequential"
   it took 628L ms with GC collects (255, 0, 0), result is: 666162.111
  Running test case: "Array.Parallel"
   it took 993L ms with GC collects (306, 2, 0), result is: 666162.111
  Running test case: "Parallel.For"
   it took 711L ms with GC collects (54, 2, 0), result is: 666162.111
  Running test case: "Parallel.For.Batched"
   it took 490L ms with GC collects (52, 2, 0), result is: 666162.111
Using outer = 1000, inner = 100000, total is: 100000000
  Running test case: "TailRecursion"
   it took 389L ms with GC collects (0, 0, 0), result is: 666661671.1
  Running test case: "Fold.Sequential"
   it took 388L ms with GC collects (0, 0, 0), result is: 666661671.1
  Running test case: "Array.Sequential"
   it took 738L ms with GC collects (249, 249, 249), result is: 666661671.1
  Running test case: "Array.Parallel"
   it took 565L ms with GC collects (249, 249, 249), result is: 666661671.1
  Running test case: "Parallel.For"
   it took 157L ms with GC collects (0, 0, 0), result is: 666661671.1
  Running test case: "Parallel.For.Batched"
   it took 110L ms with GC collects (0, 0, 0), result is: 666661671.1
Using outer = 10, inner = 10000000, total is: 100000000
  Running test case: "TailRecursion"
   it took 387L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
  Running test case: "Fold.Sequential"
   it took 390L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
  Running test case: "Array.Sequential"
   it took 811L ms with GC collects (3, 3, 3), result is: 6.666666168e+11
  Running test case: "Array.Parallel"
   it took 567L ms with GC collects (4, 4, 4), result is: 6.666666168e+11
  Running test case: "Parallel.For"
   it took 151L ms with GC collects (0, 0, 0), result is: 6.666666168e+11
  Running test case: "Parallel.For.Batched"
   it took 102L ms with GC collects (0, 0, 0), result is: 6.666666168e+11

TailRecursionFold.Sequential的效果相似。

Array.Sequential的效果更糟,因为该作业会分为两个操作mapfold。另外我们得到GC压力因为它分配了一个额外的数组。

Array.ParallelArray.Sequential相同,但使用Array.Parallel.map而不是Array.map。在这里,我们看到开始许多小型并行计算的开销,因为小输入大小会产生更多的并行计算,这会带来更多的性能。此外,即使我们使用多个内核,性能也很差。这是因为每个元素的计算量非常小,并且管理分发的开销会消耗将作业分散到多个核心的任何好处。当将390ms的单线程性能与990ms的并行性能进行比较时,可能会惊讶地发现它差3倍,但实际上它差了12倍,因为所有4个核心都用于产生慢3倍的答案。

Parallel.For做得更好,因为它允许在不分配新数组的情况下进行并行计算,并且内部开销可能更低。在这里,我们设法获得更大尺寸的性能,但由于开始并行计算的开销,仍然落后于较小尺寸的顺序算法。

Parallel.For.Batched尝试通过在每个并行计算中折叠多个数组值来增加单个计算的成本来减少开销。基本上是TailRecursion算法和Parallel.For的组合。由于这个原因,我们设法将更大尺寸的效率提高到95%,这可以被认为是不错的。

对于像这样的简单计算,可以使用AVX导致大约16倍的潜在加速,代价将变得更加毛茸茸。

通过批量并行,我们达到了预期性能加速的95%。

关键在于,连续测量并行算法的性能并将它们与简单的顺序实现进行比较非常重要。