并行Haskell为了找到一个庞大数的除数

时间:2012-09-18 09:36:55

标签: haskell parallel-processing

我使用Parallel Haskell编写了以下程序来找到10亿的除数。

import Control.Parallel

parfindDivisors :: Integer->[Integer]
parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2))
              where f1=filter g [1..(quot n 4)]
                    f2=filter g [(quot n 4)+1..(quot n 2)]
                    g z = n `rem` z == 0

main = print (parfindDivisors 1000000000)

我用ghc -rtsopts -threaded findDivisors.hs编译了程序,我运行它: findDivisors.exe +RTS -s -N2 -RTS

与简单版本相比,我发现加速率提高了50%:

findDivisors :: Integer->[Integer]
findDivisors n = filter g [1..(quot n 2)] 
      where  g z = n `rem` z == 0

我的处理器是英特尔的双核2双核处理器。 我想知道上面的代码是否有任何改进。因为在程序打印的统计中说: Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2)SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled) 这些已转换溢出 dud GC'd 失败是什么?怎样才能有助于缩短时间。

2 个答案:

答案 0 :(得分:2)

IMO,Par monad有助于推理并行性。它比处理parpseq更高一级。

这是使用parfindDivisors monad重写Par。请注意,这与您的算法基本相同:

import Control.Monad.Par

findDivisors :: Integer -> [Integer]
findDivisors n = runPar $ do
    [f0, f1] <- sequence [new, new]
    fork $ put f0 (filter g [1..(quot n 4)])
    fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)])
    [f0', f1'] <- sequence [get f0, get f1]
    return $ f0' ++ f1'
  where g z  = n `rem` z == 0

使用-O2 -threaded -rtsopts -eventlog进行编译并使用+RTS -N2 -s运行会产生以下相关的运行时统计信息:

  36,000,130,784 bytes allocated in the heap
       3,165,440 bytes copied during GC
          48,464 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     35162 colls, 35161 par    0.39s    0.32s     0.0000s    0.0006s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 1.32 (205296 / 155521, ideal 2)

  MUT     time   42.68s  ( 21.48s elapsed)
  GC      time    0.39s  (  0.32s elapsed)
  Total   time   43.07s  ( 21.80s elapsed)

  Alloc rate    843,407,880 bytes per MUT second

  Productivity  99.1% of total user, 195.8% of total elapsed

生产力非常高。为了略微改善GC工作平衡,我们可以增加GC分配区域的大小;以+RTS -N2 -s -A128M运行,例如:

  36,000,131,336 bytes allocated in the heap
          47,088 bytes copied during GC
          49,808 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       135 colls,   134 par    0.19s    0.10s     0.0007s    0.0009s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0010s    0.0010s

  Parallel GC work balance: 1.62 (2918 / 1801, ideal 2)

  MUT     time   42.65s  ( 21.49s elapsed)
  GC      time    0.20s  (  0.10s elapsed)
  Total   time   42.85s  ( 21.59s elapsed)

  Alloc rate    843,925,806 bytes per MUT second

  Productivity  99.5% of total user, 197.5% of total elapsed

但这真的只是挑剔。真实的故事来自ThreadScope:

lots of utilisation

两个内核的利用率基本上是最大的,因此可能不会发生额外的重要并行化(针对两个内核)。

关于Par monad的一些好注释是here

<强>更新

使用Par重写替代算法看起来像这样:

findDivisors ::  Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
    [a, b] <- sequence [new, new]
    fork $ put a [a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0]
    firstDivs  <- get a
    fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn]
    secondDivs <- get b
    return $ firstDivs ++ secondDivs

但你是对的,因为依赖于firstDivs,这不会从并行性中获得任何好处。

通过让Strategies参与并行地评估列表推导的元素,您仍然可以在此处合并并行性。类似的东西:

import Control.Monad.Par
import Control.Parallel.Strategies

findDivisors ::  Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
    [a, b] <- sequence [new, new]
    fork $ put a 
        ([a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq)
    firstDivs  <- get a
    fork $ put b 
        ([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq)
    secondDivs <- get b
    return $ firstDivs ++ secondDivs

并运行此操作会提供一些统计信息,例如

       3,388,800 bytes allocated in the heap
          43,656 bytes copied during GC
          68,032 bytes maximum residency (1 sample(s))

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         5 colls,     4 par    0.00s    0.00s     0.0000s    0.0001s
  Gen  1         1 colls,     1 par    0.00s    0.00s     0.0002s    0.0002s

  Parallel GC work balance: 1.22 (2800 / 2290, ideal 2)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  2 (bound)  :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)

  SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)

  MUT     time    0.01s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  Total   time    0.01s  (  0.01s elapsed)

  Alloc rate    501,672,834 bytes per MUT second

  Productivity  85.0% of total user, 95.2% of total elapsed

这里有近50个火花被转换 - 也就是说,正在进行有意义的并行工作 - 但是计算量还不足以观察并行性带来的任何挂钟增益。任何增益都可能被线程运行时中调度计算的开销所抵消。

答案 1 :(得分:1)

我认为这个页面比我更好地解释了它:

http://www.haskell.org/haskellwiki/ThreadScope_Tour/SparkOverview

我还发现这些幻灯片很有趣:

http://haskellwiki.gitit.net/Upload/HIW2011-Talk-Coutts.pdf