使用ghc并行策略时减速

时间:2015-02-22 16:11:26

标签: multithreading performance haskell parallel-processing ghc

为了了解GHC的并行策略,我编写了一个简单的粒子模拟器,在给定粒子的位置,速度和加速度的情况下,它将投射出粒子'前进的道路。

import Control.Parallel.Strategies

-- Use phantom a to store axis.
newtype Pos a = Pos Double deriving Show
newtype Vel a = Vel Double deriving Show
newtype Acc a = Acc Double deriving Show
newtype TimeStep = TimeStep Double deriving Show

-- Phantom axis
data X
data Y

-- Position, velocity, acceleration for a particle.
data Particle = Particle (Pos X) (Pos Y) (Vel X) (Vel Y) (Acc X) (Acc Y) deriving (Show)

stepParticle :: TimeStep -> Particle -> Particle
stepParticle ts (Particle x y xv yv xa ya) =
  Particle x' y' xv' yv' xa' ya'
  where
    (x', xv', xa') = step ts x xv xa
    (y', yv', ya') = step ts y yv ya

-- Given a position, velocity, and accel, calculate the pos, vel, acc after
-- a given TimeStep.
step :: TimeStep -> Pos a -> Vel a -> Acc a -> (Pos a, Vel a, Acc a)
step (TimeStep ts) (Pos p) (Vel v) (Acc a) = (Pos p', Vel v', Acc a)
  where
    v' = ts * a + v
    p' = ts * v + p

-- Build a list of lazy infinite lists of a particles' travel
-- with each update a TimeStep apart. Evaluate each inner list in
-- parallel.
simulateParticlesPar :: TimeStep -> [Particle] -> [[Particle]]
simulateParticlesPar ts = withStrategy (parList (parBuffer 250 particleStrategy))
                          . fmap (simulateParticle ts)

-- Build a lazy infinite list of the particle's travel with each
-- update being a TimeStep apart.
simulateParticle :: TimeStep -> Particle -> [Particle]
simulateParticle ts m = m' : simulateParticle ts m'
  where
    m' = stepParticle ts m

particleStrategy :: Strategy Particle
particleStrategy (Particle (Pos x) (Pos y) (Vel xv) (Vel yv) (Acc xa) (Acc ya)) = do
  x' <-  rseq x
  y' <-  rseq y
  xv' <- rseq xv
  yv' <- rseq yv
  xa' <- rseq xa
  ya' <- rseq ya
  return $ Particle (Pos x') (Pos y') (Vel xv') (Vel yv') (Acc xa') (Acc ya')

main :: IO ()
main = do
  let world = replicate 100 (Particle (Pos 0) (Pos 0) (Vel 1) (Vel 1) (Acc 0) (Acc 0))
      ts = TimeStep 0.1
  print $ fmap (take 10000) (simulateParticlesPar ts world)

对于每个粒子,我创建了一个懒惰的无限列表,将粒子的路径投射到未来。我从100个这些粒子开始并将这些粒子向前投射,我的意图是将这些粒子并行投射(大致是每个无限列表中的一个火花)。如果我将这些列表推进到足够长的时间,我预计会有显着的加速。不幸的是,我看到了轻微的减速。

编译:{{1​​}}

使用1个帖子:

ghc phys.hs -rtsopts -threaded -eventlog -O2

有2个帖子:

$ ./phys +RTS -N1 -sstderr -ls > /dev/null
  24,264,983,224 bytes allocated in the heap
     441,881,088 bytes copied during GC
       1,942,848 bytes maximum residency (104 sample(s))
          75,880 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     46820 colls,     0 par    0.82s    0.88s     0.0000s    0.0039s
  Gen  1       104 colls,     0 par    0.23s    0.23s     0.0022s    0.0037s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 1025000 (25 converted, 0 overflowed, 0 dud, 28680 GC'd, 996295 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    9.90s  ( 10.09s elapsed)
  GC      time    1.05s  (  1.11s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   10.95s  ( 11.20s elapsed)

  Alloc rate    2,451,939,648 bytes per MUT second

  Productivity  90.4% of total user, 88.4% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

我的Intel i5有2个内核和4个线程,而且它有-N4,它比-N1慢2倍(总时间~20秒)。

我花了不少时间尝试不同的策略,例如分块外部列表(因此每个spark获得多个流向前投射)并在particleStrategy中为每个字段使用rpar,但是我&#39但是还没有加速。

下面是threadscope下的事件日志的放大部分。如您所见,我几乎没有并发。大部分工作由HEC0完成,HEC1的一些活动交错进行,但一次只有一个HEC工作。这非常代表我所尝试的所有策略。 -N2 under threadscope

作为一个完整性检查,我已经运行了一些示例程序来自&#34; Haskell中的并行和并发编程&#34;并且看到这些程序的速度减慢,尽管我使用了相同的参数,这些参数在书中给了他们显着的加速!我开始认为我的ghc出了问题。

$ ./phys +RTS -N2 -sstderr -ls > /dev/null
  24,314,635,280 bytes allocated in the heap
     457,603,240 bytes copied during GC
       1,962,152 bytes maximum residency (104 sample(s))
         119,824 bytes maximum slop
               7 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     46555 colls, 46555 par    1.40s    0.85s     0.0000s    0.0048s
  Gen  1       104 colls,   103 par    0.42s    0.25s     0.0024s    0.0043s

  Parallel GC work balance: 16.85% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 1025000 (1023572 converted, 0 overflowed, 0 dud, 1367 GC'd, 61 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   11.07s  ( 11.20s elapsed)
  GC      time    1.82s  (  1.10s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   12.89s  ( 12.30s elapsed)

  Alloc rate    2,196,259,905 bytes per MUT second

  Productivity  85.9% of total user, 90.0% of total elapsed

gc_alloc_block_sync: 9222
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 2393

安装自:https://ghcformacosx.github.io/

OS X 10.10.2

更新

我在ghc跟踪器中发现了一个OS X线程RTS性能回归:https://ghc.haskell.org/trac/ghc/ticket/7602。我对指责编译器犹豫不决,但我的-N4输出支持这个假设。 &#34;并行gc字平衡&#34;太可怕了:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3

另一方面,我不知道这是否解释了我的threadscope输出,它显示缺乏任何并发性。

0 个答案:

没有答案