Question

我试图在Haskell中尝试并行评估，但似乎已经碰壁了。

就像一个实验一样，我想评估一个需要很长时间才能完成的任务列表。所以我想出了这个人为的例子。

import Control.Parallel.Strategies

startNum = 800000

bigList :: [Integer]
bigList = [2042^x | x <- [startNum..startNum+10]]

main = print $ sum $ parMap rdeepseq (length . show) bigList

我用ghc -O2 -eventlog -rtsopts -threaded test.hs --make编译了这个，然后运行它两次。

$ time ./test +RTS -N1 -lf -sstderr
29128678
   2,702,130,280 bytes allocated in the heap
      59,409,320 bytes copied during GC
       3,114,392 bytes maximum residency (68 sample(s))
       1,093,600 bytes maximum slop
              28 MB total memory in use (6 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3101 colls,     0 par    0.09s    0.08s     0.0000s    0.0005s
  Gen  1        68 colls,     0 par    0.03s    0.03s     0.0004s    0.0009s
  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
  SPARKS: 11 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 11 fizzled)
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   10.13s  ( 10.13s elapsed)
  GC      time    0.11s  (  0.11s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   10.25s  ( 10.25s elapsed)
  Alloc rate    266,683,731 bytes per MUT second
  Productivity  98.9% of total user, 98.9% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
real    0m10.250s
user    0m10.144s
sys     0m0.106s

$ time ./test +RTS -N4 -lf -sstderr
29128678
   2,702,811,640 bytes allocated in the heap
     712,017,768 bytes copied during GC
      22,024,144 bytes maximum residency (67 sample(s))
       6,134,968 bytes maximum slop
              68 MB total memory in use (3 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1329 colls,  1329 par    2.77s    0.70s     0.0005s    0.0075s
  Gen  1        67 colls,    66 par    0.11s    0.03s     0.0004s    0.0019s
  Parallel GC work balance: 40.17% (serial 0%, perfect 100%)
  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
  SPARKS: 11 (11 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   51.56s  ( 13.04s elapsed)
  GC      time    2.89s  (  0.73s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   54.45s  ( 13.77s elapsed)
  Alloc rate    52,423,243 bytes per MUT second
  Productivity  94.7% of total user, 374.4% of total elapsed
gc_alloc_block_sync: 39520
whitehole_spin: 0
gen[0].sync: 3046
gen[1].sync: 4970
real    0m13.777s
user    0m44.362s
sys     0m10.093s

我注意到GC时间略有增加，但我没想到额外的内核无法满足。

所以我让threadscope看看。

这是-N1的结果 Run with -N1

这是-N4的结果 Run with -N4

似乎火花在-N1情况下能够更快地执行。

我的问题。为什么这并没有看到我希望并行执行大量独立任务的速度？

Answer 1

这似乎与Integer操作有关。如果用其他东西替换那些，那么你会看到并行处理的加速。

此代码无法加速-N2：

main = 
  let x = length . show $ 10^10000000
      y = length . show $ 10^10000001 in
  x `par` y `pseq` print (x + y)

这也不是

main = 
  let x = (10^10000000 :: Integer) `quotRem` 123
      y = (10^10000001 :: Integer) `quotRem` 123 in
  x `par` y `pseq` print "ok"

但是这段代码确实有并行加速：

main = 
  let x = length $ replicate 1000000000 'x'
      y = length $ replicate 1000000001 'y' in
  x `par` y `pseq` print (x + y)

我找不到integer-gmp中的任何锁定。

列表的并行评估

1 个答案: