我使用Parallel Haskell编写了以下程序来找到10亿的除数。
import Control.Parallel
parfindDivisors :: Integer->[Integer]
parfindDivisors n = f1 `par` (f2 `par` (f1 ++ f2))
where f1=filter g [1..(quot n 4)]
f2=filter g [(quot n 4)+1..(quot n 2)]
g z = n `rem` z == 0
main = print (parfindDivisors 1000000000)
我用ghc -rtsopts -threaded findDivisors.hs
编译了程序,我运行它:
findDivisors.exe +RTS -s -N2 -RTS
与简单版本相比,我发现加速率提高了50%:
findDivisors :: Integer->[Integer]
findDivisors n = filter g [1..(quot n 2)]
where g z = n `rem` z == 0
我的处理器是英特尔的双核2双核处理器。
我想知道上面的代码是否有任何改进。因为在程序打印的统计中说:
Parallel GC work balance: 1.01 (16940708 / 16772868, ideal 2)
和SPARKS: 2 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
这些已转换,溢出, dud , GC'd ,失败是什么?怎样才能有助于缩短时间。
答案 0 :(得分:2)
IMO,Par
monad有助于推理并行性。它比处理par
和pseq
更高一级。
这是使用parfindDivisors
monad重写Par
。请注意,这与您的算法基本相同:
import Control.Monad.Par
findDivisors :: Integer -> [Integer]
findDivisors n = runPar $ do
[f0, f1] <- sequence [new, new]
fork $ put f0 (filter g [1..(quot n 4)])
fork $ put f1 (filter g [(quot n 4)+1..(quot n 2)])
[f0', f1'] <- sequence [get f0, get f1]
return $ f0' ++ f1'
where g z = n `rem` z == 0
使用-O2 -threaded -rtsopts -eventlog
进行编译并使用+RTS -N2 -s
运行会产生以下相关的运行时统计信息:
36,000,130,784 bytes allocated in the heap
3,165,440 bytes copied during GC
48,464 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 35162 colls, 35161 par 0.39s 0.32s 0.0000s 0.0006s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s
Parallel GC work balance: 1.32 (205296 / 155521, ideal 2)
MUT time 42.68s ( 21.48s elapsed)
GC time 0.39s ( 0.32s elapsed)
Total time 43.07s ( 21.80s elapsed)
Alloc rate 843,407,880 bytes per MUT second
Productivity 99.1% of total user, 195.8% of total elapsed
生产力非常高。为了略微改善GC工作平衡,我们可以增加GC分配区域的大小;以+RTS -N2 -s -A128M
运行,例如:
36,000,131,336 bytes allocated in the heap
47,088 bytes copied during GC
49,808 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 135 colls, 134 par 0.19s 0.10s 0.0007s 0.0009s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0010s 0.0010s
Parallel GC work balance: 1.62 (2918 / 1801, ideal 2)
MUT time 42.65s ( 21.49s elapsed)
GC time 0.20s ( 0.10s elapsed)
Total time 42.85s ( 21.59s elapsed)
Alloc rate 843,925,806 bytes per MUT second
Productivity 99.5% of total user, 197.5% of total elapsed
但这真的只是挑剔。真实的故事来自ThreadScope:
两个内核的利用率基本上是最大的,因此可能不会发生额外的重要并行化(针对两个内核)。
关于Par
monad的一些好注释是here。
<强>更新强>
使用Par
重写替代算法看起来像这样:
findDivisors :: Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
[a, b] <- sequence [new, new]
fork $ put a [a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0]
firstDivs <- get a
fork $ put b [n `quot` x | x <- firstDivs, x /= sqrtn]
secondDivs <- get b
return $ firstDivs ++ secondDivs
但你是对的,因为依赖于firstDivs
,这不会从并行性中获得任何好处。
通过让Strategies
参与并行地评估列表推导的元素,您仍然可以在此处合并并行性。类似的东西:
import Control.Monad.Par
import Control.Parallel.Strategies
findDivisors :: Integer -> [Integer]
findDivisors n = let sqrtn = floor (sqrt (fromInteger n)) in runPar $ do
[a, b] <- sequence [new, new]
fork $ put a
([a | (a, b) <- [quotRem n x | x <- [1..sqrtn]], b == 0] `using` parListChunk 2 rdeepseq)
firstDivs <- get a
fork $ put b
([n `quot` x | x <- firstDivs, x /= sqrtn] `using` parListChunk 2 rdeepseq)
secondDivs <- get b
return $ firstDivs ++ secondDivs
并运行此操作会提供一些统计信息,例如
3,388,800 bytes allocated in the heap
43,656 bytes copied during GC
68,032 bytes maximum residency (1 sample(s))
Tot time (elapsed) Avg pause Max pause
Gen 0 5 colls, 4 par 0.00s 0.00s 0.0000s 0.0001s
Gen 1 1 colls, 1 par 0.00s 0.00s 0.0002s 0.0002s
Parallel GC work balance: 1.22 (2800 / 2290, ideal 2)
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 1 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 2 (bound) : 0.01s ( 0.01s) 0.00s ( 0.00s)
Task 3 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s)
SPARKS: 50 (49 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
MUT time 0.01s ( 0.00s elapsed)
GC time 0.00s ( 0.00s elapsed)
Total time 0.01s ( 0.01s elapsed)
Alloc rate 501,672,834 bytes per MUT second
Productivity 85.0% of total user, 95.2% of total elapsed
这里有近50个火花被转换 - 也就是说,正在进行有意义的并行工作 - 但是计算量还不足以观察并行性带来的任何挂钟增益。任何增益都可能被线程运行时中调度计算的开销所抵消。
答案 1 :(得分:1)