加速Haskell并发

时间:2012-09-18 10:41:03

标签: memory haskell concurrency lazy-evaluation

MVar,TVar,IORef,......我无法加速一个thunk问题(我认为)。

(我原来的问题是一个线程代码,我做“forkIO”n次调用“addMany”;但我认为我的问题是“shW”函数)

让下一个代码:

{-# LANGUAGE BangPatterns #-}
import Control.Concurrent
import Control.Monad
import System.Environment(getArgs)
import Data.Int
import Data.IORef

-- "i" times, add "n" for each IORef (in "a")
addMany :: [IORef Int64] -> Int64 -> Int64 -> IO ()
addMany !a !n !i =
  forM_ [1..i] (\_ ->
    forM_ a (shW n))

-- MVar, TVar, IORef, ... read/write (x' = x + k)
shR = readIORef
shW !k !r = atomicModifyIORef r (\ !x' -> (x' + k, ()))

main = do
  (n':i':_) <- getArgs
  let (n, i) = (read n', read i')
  v <- forM [1..n] (\_ -> newIORef 0)
  addMany v 1 i
  mapM shR v >>= (putStrLn.show.sum)

然后,个人资料结果显示:

MUT     time    3.12s  (  3.12s elapsed)
GC      time    6.96s  (  6.96s elapsed)
...

COST CENTRE  MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN         MAIN                     47           0    0.0    0.0   100.0  100.0
 main        Main                     95           0    0.0    0.0   100.0  100.0
  main.i     Main                    100           1    0.0    0.0     0.0    0.0
  addMany    Main                     99           1    0.4    0.5   100.0  100.0
   addMany.\ Main                    101       15000    6.6    0.0    99.6   99.5
    shW      Main                    102     2250000   92.7   99.5    93.0   99.5
     shW.\   Main                    104     2250000    0.3    0.0     0.3    0.0

我无法删除“shW”调用上的thunk(并且内存使用量很大)。怎么了?

类似的C#代码运行得更快(更快):

class Node { 
    private object m; 
    private int n; 

    public Node() {n = 0; m = new object();} 
    public void Inc() {lock(m) n++;} 
    public int Read() {return n;} 
} 

class MainClass { 

    public static void Main(string[] args) { 

        var nitems = 1000; 
        var nthreads = 6; 
        var niters = 100000; 

        var g = Enumerable.Range(0, nitems).Select(i => new Node()).ToArray(); 
        Task.WaitAll(Enumerable.Range(0, nthreads).Select(q => Task.Factory.StartNew(() => { 
            var z = niters; 
            while(z-- > 0) 
                foreach(var i in g) 
                    i.Inc(); 
        })).ToArray()); 

        Console.WriteLine("Sum node values: {0}", g.Sum(i => i.Read())); 

    } 
} 

非常感谢!

更新

解决原始问题:https://gist.github.com/3742897

非常感谢 Don Stewart

1 个答案:

答案 0 :(得分:27)

当您查看堆和GC时间时,很明显会发生泄漏:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
   1,208,298,840 bytes allocated in the heap
   1,352,861,868 bytes copied during GC
     280,181,684 bytes maximum residency (9 sample(s))
       4,688,680 bytes maximum slop
             545 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1677 colls,     0 par    2.27s    2.22s     0.0013s    0.0063s
  Gen  1         9 colls,     0 par    1.66s    1.77s     0.1969s    1.0273s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.70s  (  0.77s elapsed)
  GC      time    3.92s  (  4.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    4.62s  (  4.77s elapsed)

  %GC     time      84.8%  (83.8% elapsed)

  Alloc rate    1,718,469,461 bytes per MUT second

  Productivity  15.2% of total user, 14.7% of total elapsed

  real    0m4.752s
  user    0m0.015s
  sys     0m0.046s

280M驻留率和89%GC。很多人都被分配并扔掉了。

堆配置文件使这个显而易见。

enter image description here

线索是这些是“stg_app *”的东西(即STG机器应用thunk)。

modify 系列函数的一个微妙但非常显着的问题是这里的问题 - 当你有一个懒惰的atomicModify时,根本没有办法严格更新字段,没有要求价值。

因此,atomicModifyIORef r (\ !x' -> (x' + k, ()))所做的所有谨慎使用都是构建(+)函数的一系列应用程序,以便观察链的结果(即单元格中的值),每次添加它的论点是严格的。不是你想要的!您对参数对modifyIORef的严格注释都不会对单元本身产生任何影响。现在,通常一个懒惰的修改是你想要的 - 它只是一个指针交换,所以你可以有非常短的原子部分。

但有时那不是你想要的。

(有关此问题的背景请参阅GHC票证#5926,但至少在2007年我写了strict-concurrency包以避免MVars的这个问题时就已知道这个问题了。它是discussed in 2009,我们现在在2012年有严格的版本。)

首先要求该值,您可以删除该问题。例如。

shW !k !r = --atomicModifyIORef r (\x -> (x + k, ()))
    do x <- shR r
       writeIORef r $! (x+1)

请注意,此问题现在为documented in the libraries,您可以使用atomicModifyIORef'来避免此问题。

我们得到:

USDGHTVVH1$ time ./A 1000 10000 +RTS -s
10000000
     120,738,836 bytes allocated in the heap
       3,758,476 bytes copied during GC
          73,020 bytes maximum residency (1 sample(s))
          16,348 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       230 colls,     0 par    0.00s    0.01s     0.0000s    0.0001s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.17s  (  0.17s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.19s  (  0.17s elapsed)

  %GC     time       0.0%  (3.9% elapsed)

  Alloc rate    643,940,458 bytes per MUT second

  Productivity 100.0% of total user, 108.3% of total elapsed


real    0m0.218s
user    0m0.000s
sys     0m0.015s

也就是说,22倍的加速,内存使用变得不变。对于笑,这是新的堆配置文件:

enter image description here