我正在Haskell学习并行策略。
我根据梯形规则(顺序)编写了积分器函数:
integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT f (ini, fin) dx
= let lst = map f [ini,ini+dx..fin]
in sum lst * dx - 0.5 * (f ini + f fin) * dx
我按如下方式运行它:
main = do
print $ (integrateT (\x -> x^4 - x^3 + x^2 + x/13 + 1) (0.0,1000000.0) 0.01 :: Double)
以下是运行统计信息:
stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
18,400,147,552 bytes allocated in the heap
20,698,168 bytes copied during GC
66,688 bytes maximum residency (2 sample(s))
35,712 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17754 colls, 17754 par 0.123s 0.105s 0.0000s 0.0011s
Gen 1 2 colls, 1 par 0.000s 0.000s 0.0001s 0.0002s
Parallel GC work balance: 0.27% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 6.054s ( 5.947s elapsed)
GC time 0.123s ( 0.106s elapsed)
EXIT time 0.001s ( 0.008s elapsed)
Total time 6.178s ( 6.061s elapsed)
Alloc rate 3,039,470,269 bytes per MUT second
Productivity 98.0% of total user, 98.2% of total elapsed
gc_alloc_block_sync: 77
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
如您所见,它运行非常快。但是,当我尝试使其平行时,例如:
integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT f (ini, fin) dx
= let lst = (map f [ini,ini+dx..fin]) `using` parListChunk 100 rdeepseq
in sum lst * dx - 0.5 * (f ini + f fin) * dx
它总是运行更长的时间。这是上面运行的示例统计信息:
stack exec lab5 -- +RTS -ls -N2 -s
1.9999974872991426e29
59,103,320,488 bytes allocated in the heap
17,214,458,128 bytes copied during GC
2,787,092,160 bytes maximum residency (15 sample(s))
43,219,264 bytes maximum slop
5570 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 44504 colls, 44504 par 16.907s 10.804s 0.0002s 0.0014s
Gen 1 15 colls, 14 par 4.006s 2.991s 0.1994s 1.2954s
Parallel GC work balance: 33.60% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)
SPARKS: 1000001 (1000001 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 14.298s ( 12.392s elapsed)
GC time 20.912s ( 13.795s elapsed)
EXIT time 0.000s ( 0.003s elapsed)
Total time 35.211s ( 26.190s elapsed)
Alloc rate 4,133,806,996 bytes per MUT second
Productivity 40.6% of total user, 47.3% of total elapsed
gc_alloc_block_sync: 2304055
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 1105370
我在这里看到的几件事:
我做了一些实验:
答案 0 :(得分:3)
类似于parListChunk
的策略仅在到目前为止大部分计算成本用于评估各个列表元素时才有意义,而与首先构建列表书脊等的开销相比。但是,在简单多项式上使用integrateT
时,这些元素的计算非常便宜,并且绝大部分的开销都在列表开销中。它仍然可以在顺序版本中高效运行的唯一原因是,GHC可以内联/融合大部分业务,但显然不能在并行版本中进行。
解决方案:让每个线程运行适当可熔的顺序版本,即,划分 interval 而不是离散列表形式。喜欢
integrateT :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT f (ini, fin) dx
= let lst = map f [ini,ini+dx..fin]
in sum lst * dx - 0.5 * (f ini + f fin) * dx
integrateT_par :: (Fractional a, Enum a, NFData a) => (a -> a) -> (a,a) -> a -> a
integrateT_par f (l,r) dx
= let chunks = [ integrateT f (l + i*wChunk, l + (i+1)*wChunk) dx
| i<-[0..nChunks-1] ]
`using` parList rdeepseq
in sum chunks
where nChunks = 100
wChunk = (r-l)/nChunks
这将不会比顺序版本具有明显更差的内存或性能。
记住,正如我现在测试的那样,它实际上似乎根本不是并行运行的,但是可以肯定的是我只是错过了设置中的某些内容...