我正在处理计算,该计算具有列表A = [B]作为中间结果,列表是长度为L的K列表。计算B的元素的时间复杂度由参数控制M和理论上在M中是线性的。从理论上讲,我认为计算A的时间复杂度为O(K * L * M)。但事实并非如此,我不明白为什么?
这是一个简单的完整草图程序,它展示了我已经解释过的问题
import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)
type Point = (Double, Double)
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map initial = initial : (trajectory map $ map initial)
justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ [] = []
subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m
ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
ps = randoms (mkStdGen 21)
in take n $ zip qs ps
ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) ->
Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m =
parMap rdeepseq ((map observable . subTrace n m) . orbitGen)
main = let k = 100
l = 100
m = 100
orbitGen = trajectory (standardMap 7)
observable (p, q) = p^2 - q^2
initials = ensemble k
mean xs = (sum xs) / (fromIntegral $ length xs)
result = (map mean)
$ transpose
$ ensembleTrace orbitGen observable l m
$ initials
in mapM_ print result
我正在编译
$ ghc -O2 stdmap.hs -threaded
并以
运行$ ./stdmap +RTS -N4 > /dev/null
在intel Q6600,Linux 3.6.3-1-ARCH上,使用GHC 7.6.1并获得以下结果 对于不同的参数组K,L,M(程序代码中的k,l,m)
(K=200,L=200,N=200) -> real 0m0.774s
user 0m2.856s
sys 0m0.147s
(K=2000,L=200,M=200) -> real 0m7.409s
user 0m28.102s
sys 0m1.080s
(K=200,L=2000,M=200) -> real 0m7.326s
user 0m27.932s
sys 0m1.020s
(K=200,L=200,M=2000) -> real 0m10.581s
user 0m38.564s
sys 0m3.376s
(K=20000,L=200,M=200) -> real 4m22.156s
user 7m30.007s
sys 0m40.321s
(K=200,L=20000,M=200) -> real 1m16.222s
user 4m45.891s
sys 0m15.812s
(K=200,L=200,M=20000) -> real 8m15.060s
user 23m10.909s
sys 9m24.450s
我不太明白这种纯缩放的问题可能在哪里。如果我理解正确,列表是懒惰的,不应该构建,因为它们是在头尾方向消耗的?从测量结果可以看出,过量的实时消耗和过多的系统时间消耗之间存在相关性,因为超出系统账户。但是如果有一些内存管理浪费时间,它仍然应该在K,L,M中线性扩展。
帮助!
修改
我根据Daniel Fisher提出的建议对代码进行了修改,这确实解决了关于M的不良扩展问题。正如所指出的那样,通过强制对轨迹进行严格评估,我们避免了大型thunk的构建。我理解其背后的性能改进,但我仍然不理解原始代码的错误扩展,因为(如果我理解正确的话)thunk构造的时空复杂性应该是M的线性?
此外,我仍然无法理解关于K(整体的大小)的不良缩放。我用K = 8000和K = 16000的改进代码进行了两次额外的测量,保持L = 200,M = 200。按比例缩放至K = 8000,但对于K = 16000,它已经异常。问题似乎是overflowed
SPARKS
的数量,K = 8000时为0,K = 16000时为7802。这可能反映在错误的并发性中,我量化为商Q = (MUT cpu time) / (MUT real time)
,理想情况下等于CPU-s的数量。然而,对于K = 8000,Q~4,对于K = 16000,Q~2。
请帮助我理解这个问题的根源和可能的解决方案。
K = 8000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
56,905,405,184 bytes allocated in the heap
503,501,680 bytes copied during GC
53,781,168 bytes maximum residency (15 sample(s))
6,289,112 bytes maximum slop
151 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 27893 colls, 27893 par 7.85s 1.99s 0.0001s 0.0089s
Gen 1 15 colls, 14 par 1.20s 0.30s 0.0202s 0.0558s
Parallel GC work balance: 23.49% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 95.90s ( 24.28s elapsed)
GC time 9.04s ( 2.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 104.95s ( 26.58s elapsed)
Alloc rate 593,366,811 bytes per MUT second
Productivity 91.4% of total user, 360.9% of total elapsed
gc_alloc_block_sync: 315819
和
K = 16000:
$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null
113,809,786,848 bytes allocated in the heap
1,156,991,152 bytes copied during GC
114,778,896 bytes maximum residency (18 sample(s))
11,124,592 bytes maximum slop
300 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 135521 colls, 135521 par 22.83s 6.59s 0.0000s 0.0190s
Gen 1 18 colls, 17 par 2.72s 0.73s 0.0405s 0.1692s
Parallel GC work balance: 18.05% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 221.77s (139.78s elapsed)
GC time 25.56s ( 7.32s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 247.34s (147.10s elapsed)
Alloc rate 513,176,874 bytes per MUT second
Productivity 89.7% of total user, 150.8% of total elapsed
gc_alloc_block_sync: 814824
答案 0 :(得分:7)
M. A. D.关于fmod
的观点很好,但是没有必要呼唤C,我们可以更好地留在Haskell土地上(ticket链接线程是关于同时固定的)。
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b in b * (q - fromIntegral (floor q))
是类型默认导致floor :: Double -> Integer
(并因此fromIntegral :: Integer -> Double
)被调用。现在,Integer
是一种比较复杂的类型,操作速度很慢,从Integer
到Double
的转换也相对复杂。原始代码(包含参数k = l = 200
和m = 5000
)生成了统计信息
./nstdmap +RTS -s -N2 > /dev/null
60,601,075,392 bytes allocated in the heap
36,832,004,184 bytes copied during GC
2,435,272 bytes maximum residency (13741 sample(s))
887,768 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 46734 colls, 46734 par 41.66s 20.87s 0.0004s 0.0058s
Gen 1 13741 colls, 13740 par 23.18s 11.62s 0.0008s 0.0041s
Parallel GC work balance: 60.58% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 34.99s ( 17.60s elapsed)
GC time 64.85s ( 32.49s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 99.84s ( 50.08s elapsed)
Alloc rate 1,732,048,869 bytes per MUT second
Productivity 35.0% of total user, 69.9% of total elapsed
在我的机器上(-N2
,因为我只有两个物理内核)。只需更改代码以使用类型签名floor q :: Int
即可将其归结为
./nstdmap +RTS -s -N2 > /dev/null
52,105,495,488 bytes allocated in the heap
29,957,007,208 bytes copied during GC
2,440,568 bytes maximum residency (10481 sample(s))
893,224 bytes maximum slop
8 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 36979 colls, 36979 par 32.96s 16.51s 0.0004s 0.0066s
Gen 1 10481 colls, 10480 par 16.65s 8.34s 0.0008s 0.0018s
Parallel GC work balance: 68.64% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.01s ( 0.01s elapsed)
MUT time 29.78s ( 14.94s elapsed)
GC time 49.61s ( 24.85s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 79.40s ( 39.80s elapsed)
Alloc rate 1,749,864,775 bytes per MUT second
Productivity 37.5% of total user, 74.8% of total elapsed
经过时间减少约20%,MUT时间减少13%。不错。如果我们查看优化后的floor
代码,我们可以看到原因:
floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
case double2Int# x of
n | x <## int2Double# n -> I# (n -# 1#)
| otherwise -> I# n
floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
case decodeDoubleInteger x of
(# m, e #)
| e <# 0# ->
case negateInt# e of
s | s ># 52# -> if m < 0 then (-1) else 0
| otherwise ->
case TO64 m of
n -> FROM64 (n `uncheckedIShiftRA64#` s)
| otherwise -> shiftLInteger m e
floor :: Double -> Int
只使用机器转换,而floor :: Double -> Integer
需要昂贵的decodeDoubleInteger
和更多分支。但是在这里调用floor
的地方,我们知道所有涉及的Double
都是非负的,因此floor
与truncate
相同,后者直接映射到机器转换{{ 1}},让我们试试而不是double2Int#
:
floor
一个非常小的减少(预计, INIT time 0.00s ( 0.00s elapsed)
MUT time 29.29s ( 14.70s elapsed)
GC time 49.17s ( 24.62s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 78.45s ( 39.32s elapsed)
并不是真正的瓶颈)。为了比较,呼叫C:
fmod
有点慢(不出所料,你可以在调用C的时候执行一些初始化)。
但那不是大鱼游泳的地方。不好的是,只选择轨迹的每个 INIT time 0.01s ( 0.01s elapsed)
MUT time 31.46s ( 15.78s elapsed)
GC time 54.05s ( 27.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 85.52s ( 42.85s elapsed)
元素会导致大量的风暴导致大量的分配,并且需要很长时间来评估时间的来临。因此,让我们消除这种泄漏并使轨迹严格:
m
这大大减少了分配和GC时间,因此也缩短了MUT时间:
{-# LANGUAGE BangPatterns #-}
trajectory :: (Point -> Point) -> Point -> [Point]
trajectory map !initial@(!a,!b) = initial : (trajectory map $ map initial)
使用原始 INIT time 0.00s ( 0.00s elapsed)
MUT time 21.83s ( 10.95s elapsed)
GC time 0.72s ( 0.36s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 22.55s ( 11.31s elapsed)
,
fmod
INIT time 0.00s ( 0.00s elapsed)
MUT time 18.26s ( 9.18s elapsed)
GC time 0.58s ( 0.29s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 18.84s ( 9.47s elapsed)
,并且在floor q :: Int
的测量精度与truncate q :: Int
相同的时间内(truncate
的分配数字略低)。
问题似乎在于溢出的SPARKS数量,K = 8000时为0,K = 16000时为7802。这可能反映在糟糕的并发性
中
是的(虽然据我所知,这里更正确的术语是并行而不是并发),但是有一个火花池,当它满了时,任何进一步的火花都没有安排在下一个有时间的线程中进行评估当轮到它时,计算在没有并行性的情况下从父线程完成。在这种情况下,这意味着在初始并行阶段之后,计算会回落到顺序阶段。
火花池的大小显然约为8K(2 ^ 13)。
如果您通过顶部观察CPU负载,您会看到它在一段时间后从(close to 100%)*(number of cores)
下降到更低的值(对我来说,-N2
和~130时它是~100% %-N4
)。
治愈是为了避免过多的火花,并让每个火花做更多的工作。随着快速和肮脏的修改
ensembleTrace orbitGen observable n m =
withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)
我使用-N2
回到200%几乎整个运行和良好的生产力,
INIT time 0.00s ( 0.00s elapsed)
MUT time 57.42s ( 29.02s elapsed)
GC time 5.34s ( 2.69s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 62.76s ( 31.71s elapsed)
Alloc rate 1,982,155,167 bytes per MUT second
Productivity 91.5% of total user, 181.1% of total elapsed
并且使用-N4
它也很好(即使在挂钟上速度稍微快一点 - 因为所有线程基本相同,而且我只有2个物理内核),
INIT time 0.00s ( 0.00s elapsed)
MUT time 99.17s ( 26.31s elapsed)
GC time 16.18s ( 4.80s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 115.36s ( 31.12s elapsed)
Alloc rate 1,147,619,609 bytes per MUT second
Productivity 86.0% of total user, 318.7% of total elapsed
因为现在火花池没有溢出。
正确的解决方法是使块的大小成为根据轨迹数和可用内核计算的参数,以便火花的数量不超过池大小。