我试图在Haskell中尝试并行评估,但似乎已经碰壁了。
就像一个实验一样,我想评估一个需要很长时间才能完成的任务列表。所以我想出了这个人为的例子。
import Control.Parallel.Strategies
startNum = 800000
bigList :: [Integer]
bigList = [2042^x | x <- [startNum..startNum+10]]
main = print $ sum $ parMap rdeepseq (length . show) bigList
我用ghc -O2 -eventlog -rtsopts -threaded test.hs --make
编译了这个,然后运行它
两次。
$ time ./test +RTS -N1 -lf -sstderr
29128678
2,702,130,280 bytes allocated in the heap
59,409,320 bytes copied during GC
3,114,392 bytes maximum residency (68 sample(s))
1,093,600 bytes maximum slop
28 MB total memory in use (6 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3101 colls, 0 par 0.09s 0.08s 0.0000s 0.0005s
Gen 1 68 colls, 0 par 0.03s 0.03s 0.0004s 0.0009s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 11 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 11 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 10.13s ( 10.13s elapsed)
GC time 0.11s ( 0.11s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 10.25s ( 10.25s elapsed)
Alloc rate 266,683,731 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real 0m10.250s
user 0m10.144s
sys 0m0.106s
$ time ./test +RTS -N4 -lf -sstderr
29128678
2,702,811,640 bytes allocated in the heap
712,017,768 bytes copied during GC
22,024,144 bytes maximum residency (67 sample(s))
6,134,968 bytes maximum slop
68 MB total memory in use (3 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1329 colls, 1329 par 2.77s 0.70s 0.0005s 0.0075s
Gen 1 67 colls, 66 par 0.11s 0.03s 0.0004s 0.0019s
Parallel GC work balance: 40.17% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 11 (11 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 51.56s ( 13.04s elapsed)
GC time 2.89s ( 0.73s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 54.45s ( 13.77s elapsed)
Alloc rate 52,423,243 bytes per MUT second
Productivity 94.7% of total user, 374.4% of total elapsed
gc_alloc_block_sync: 39520
whitehole_spin: 0
gen[0].sync: 3046
gen[1].sync: 4970
real 0m13.777s
user 0m44.362s
sys 0m10.093s
我注意到GC时间略有增加,但我没想到额外的内核无法满足。
所以我让threadscope
看看。
这是-N1的结果
这是-N4的结果
似乎火花在-N1情况下能够更快地执行。
我的问题。为什么这并没有看到我希望并行执行大量独立任务的速度?
答案 0 :(得分:5)
这似乎与Integer操作有关。如果用其他东西替换那些,那么你会看到并行处理的加速。
此代码无法加速-N2
:
main =
let x = length . show $ 10^10000000
y = length . show $ 10^10000001 in
x `par` y `pseq` print (x + y)
这也不是
main =
let x = (10^10000000 :: Integer) `quotRem` 123
y = (10^10000001 :: Integer) `quotRem` 123 in
x `par` y `pseq` print "ok"
但是这段代码确实有并行加速:
main =
let x = length $ replicate 1000000000 'x'
y = length $ replicate 1000000001 'y' in
x `par` y `pseq` print (x + y)
我找不到integer-gmp
中的任何锁定。