Haskell:不使用所有核心的并行程序

时间:2016-09-16 21:20:55

标签: multithreading haskell ghc

以下代码具有相同的性能,无论是使用-threaded编译还是不编译,或者以单线程方式编写代码。两个块(使用par和注释的forkIO/forkOS/forkOn)都会产生相同的性能。事实上,并行版本的性能略有下降(可能是由于并行GC的开销)。从像htop这样的程序中查看CPU利用率只显示一个CPU被挂钩,这非常令人困惑,因为我阅读代码的原因是它应该使用大部分内核。

forkOS不使用更多内核的事实特别令人困惑,因为ghc/rts/posix/OSThreads.c:forkOS_createThread的相关部分似乎暗示它会强制拨打pthread_create

-- (Apologies if I have missed an import or two)

import Data.List
import GHC.Conc
import Control.Concurrent
import Control.DeepSeq
import qualified Data.HashMap.Lazy as HM
main :: IO ()
main = do
  let [one::Int, two] = [15, 1000000]
{-
  s <- numSparks
  putStrLn $ "Num sparks " <> show s
  n <- getNumCapabilities
  putStrLn $ "Num capabilities " <> show n
  m <- newEmptyMVar
  forkIO $ void $ forM [(1::Int)..one] $ \cpu -> do
    -- forkOn cpu $ void $ do
    forkOS $ void $ do
    -- forkIO $ void $ do
    -- void $ do
      putStrLn $ "core " <> show cpu
      s <- return $ sort $ HM.keys $ HM.fromList $ zip [cpu..two + cpu] (repeat (0::Int))
      putStrLn $ "core " <> show cpu <> " done " <> show (sum s)
      putMVar m ()
  forM [1..one] $ \i -> takeMVar m
  let s :: String = "hey!"
  putStrLn s
-}
  print one
  print two
  let __pmap__ f xs = case xs of
       [] -> []
       x:xs -> let y = f x
                   ys = __pmap__ f xs
                   in (y `par` ys) `pseq` (y: ys)
  n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
    force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
  putStrLn $ "sum " <> show n
  s <- numSparks
  putStrLn $ "Num sparks " <> show s

我的.cabal文件中的相关部分

  ghc-options:
    -threaded
    -rtsopts
    "-with-rtsopts=-N15 -qg1"

平台信息

$ stack --version
Version 1.2.0, Git revision 241cd07d576d9c0c0e712e83d947e3dd64541c42 (4054 commits) x86_64 hpack-0.14.0
$ stack exec ghc -- --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.1 LTS
Release:    16.04
Codename:   xenial
$ uname -r
4.4.0-36-generic

为什么我的代码不能并行化?

编辑:如果它有用,添加-s运行时标志会产生以下报告

  21,829,377,776 bytes allocated in the heap
 126,512,021,712 bytes copied during GC
      86,659,312 bytes maximum residency (322 sample(s))
       6,958,976 bytes maximum slop
             218 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     41944 colls,     0 par   16.268s  17.272s     0.0004s    0.0011s
  Gen  1       322 colls,   321 par   237.056s  23.822s     0.0740s    0.2514s

  Parallel GC work balance: 13.01% (serial 0%, perfect 100%)

  TASKS: 32 (1 bound, 31 peak workers (31 total), using -N15)

  SPARKS: 15 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 15 fizzled)

  INIT    time    0.004s  (  0.003s elapsed)
  MUT     time   12.504s  ( 13.301s elapsed)
  GC      time  253.324s  ( 41.094s elapsed)
  EXIT    time    0.000s  (  0.017s elapsed)
  Total   time  265.920s  ( 54.413s elapsed)

  Alloc rate    1,745,791,568 bytes per MUT second

  Productivity   4.7% of total user, 23.1% of total elapsed

gc_alloc_block_sync: 10725286
whitehole_spin: 0
gen[0].sync: 2171
gen[1].sync: 1057315

EDIT2:与竞技场大小的混乱似乎有很大帮助。我将-H2G -A1G添加到RTS选项中,时间从43秒降至5.2秒。还有什么可以改善的情况,以获得完整的15倍加速?

EDIT3:编辑代码以反映两位提供反馈的人建议的parpseq模式

1 个答案:

答案 0 :(得分:1)

问题是由__pmap__的定义引起的。具体而言,以下表达式存在问题:

let y = f x
 in y `par` (y: __pmap__ f xs)

您可能希望这会导致yy: __pmap__ f xs并行评估,但事实并非如此。会发生什么是GHC尝试并行评估它们,但第二个子表达式包含y,这是第一个子表达式。因此,第二个子表达式取决于第一个子表达式,因此无法并行评估它们。编写上述表达式的正确方法是

let y = f x
    ys = __pmap__ f xs
 in y `par` (ys `pseq` (y : ys))

因为pseq会强制ysy : ys之前进行评估,因此可以在y的评估正在运行时开始评估第二个子表达式。有关此问题的讨论,请参阅此thread

所以把它们放在一起,我们得到以下结果:

main :: IO ()
main = do
  let [one::Int, two] = [15, 1000000]
  print one
  print two
  let __pmap__ f xs = case xs of
        [] -> []
        x:xs -> let y = f x
                    ys = __pmap__ f xs
                 in y `par` ys `pseq` (y : ys)
  n <- pure $ sum . concat $ flip __pmap__ [1..one] $ \i ->
    traceShow i $ force $ sort $ HM.keys $ HM.fromList $ zip [i..(two + i)] (repeat (0::Int))
  putStrLn $ "sum " <> show n
  s <- numSparks
  putStrLn $ "Num sparks " <> show s

请注意,我已添加traceShow(来自Debug.Trace)。如果您在-N1中使用rtsopts运行此操作,您会看到列表将一次评估一个元素,而如果您使用-N3,则会一次评估3个元素

故事的寓意是parpseq容易被滥用,因此您应该更喜欢更高级别的解决方案,例如parMap rdeepseq(相当于您的__pmap__来自parallel