我的问题是我的代码创建太多thunk (超过270MB),因此在按键分组值时,GC会花费太多时间(超过70%)。我想知道按键分组值的最佳方法是什么。
问题是我有矢量表示的键和值,我想按键保留顺序对值进行分组。例如:
输入:
keys = 1 2 4 3 1 3 4 2 1
vals = 1 2 3 4 5 6 7 8 9
输出:
1 = 1,5,9
2 = 2,8
3 = 4,6
4 = 3,7
编译选项:
ghc --make -03 -fllvm histogram.hs
在命令式编程中,我只使用多图,因此我决定使用哈希表,其中关联值为[Int]以存储分组值。我希望有更好的FP解决方案。
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec
ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
V.foldl' (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec
main :: IO ()
main = ts2 kv `seq` putStrLn "done"
这是在运行时吐出的东西:
3,117,102,992 bytes allocated in the heap
1,847,205,880 bytes copied during GC
324,159,752 bytes maximum residency (12 sample(s))
6,502,224 bytes maximum slop
658 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 5991 colls, 0 par 0.58s 0.58s 0.0001s 0.0003s
Gen 1 12 colls, 0 par 0.69s 0.69s 0.0577s 0.3070s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.45s ( 0.45s elapsed)
GC time 1.27s ( 1.27s elapsed)
EXIT time 0.03s ( 0.03s elapsed)
Total time 1.75s ( 1.75s elapsed)
%GC time 72.7% (72.8% elapsed)
Alloc rate 6,933,912,935 bytes per MUT second
Productivity 27.3% of total user, 27.3% of total elapsed
你可以看到它花了很多时间在GC上,所以我决定使用刘海来严格控制列表连接。我想 ++ 也非常昂贵,但不知道解决方法。
答案 0 :(得分:3)
那些严格的注释是无用的。他们只强制列表的第一个构造函数。
更糟糕的是,您似乎试图离开折叠(++)
,这绝不是一个好主意。它导致大量无用的中间列表复制,即使它完全严格。
您应该折叠为[Int] -> [Int]
值。这将摆脱多个无用的分配。我在移动设备上,所以我无法真正提供完整的示例代码。主要思想是将循环更改为M.insertWith (.) k (v:)
,然后在折叠后将($ [] )
映射到HashMap中的值。
答案 1 :(得分:2)
我尝试在我的主机上运行您的代码,但我无法重现您的个人资料:
runhaskell test8.hs +RTS -sstderr
done
120,112 bytes allocated in the heap
3,520 bytes copied during GC
68,968 bytes maximum residency (1 sample(s))
12,952 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.09s 0.0909s 0.0909s
INIT time 0.00s ( 0.01s elapsed)
MUT time 0.00s ( 29.21s elapsed)
GC time 0.00s ( 0.09s elapsed)
EXIT time 0.00s ( 0.09s elapsed)
Total time 0.01s ( 29.40s elapsed)
%GC time 5.7% (0.3% elapsed)
Alloc rate 381,307,936 bytes per MUT second
Productivity 91.1% of total user, 0.0% of total elapsed
您能否详细介绍一下如何测试代码?如果你正在使用ghci那么
$ ghci -fobject-code
我们可能需要使用-fobject-code来消除ghci中的任何空间泄漏。如果您已经尝试过ghci选项,假设您正在使用ghci,我将编辑我的答案。此时,我想重现您所看到的问题。
更新: @ duplode:谢谢你的指点。我将删除之前的输出,没有人反对它,因为它有误导性。
我已经能够使用以下选项之一稍微减少gc开销。我得到了一些好处,但开销仍然在49-50%的范围内:
ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) ->
let
element = M.lookup k ht in
case element of
Nothing -> M.insert k [v] ht
Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec =
let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec
调整看起来好一点,但结果看起来与直接查找类似。用ts4使用adjust:
calling ts4 done.
3,838,059,320 bytes allocated in the heap
2,041,603,344 bytes copied during GC
377,412,728 bytes maximum residency (6 sample(s))
7,725,944 bytes maximum slop
737 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 7260 colls, 0 par 1.32s 1.45s 0.0002s 0.0013s
Gen 1 6 colls, 0 par 0.88s 1.40s 0.2328s 0.9236s
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.18s ( 2.21s elapsed)
GC time 2.19s ( 2.85s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.01s ( 0.07s elapsed)
Total time 4.38s ( 5.13s elapsed)
%GC time 50.0% (55.5% elapsed)
Alloc rate 1,757,267,879 bytes per MUT second
Productivity 50.0% of total user, 42.7% of total elapsed
使用简单的查找/更新(更新地图的命令式)
calling ts3 done.
3,677,137,816 bytes allocated in the heap
2,040,053,712 bytes copied during GC
395,867,512 bytes maximum residency (6 sample(s))
7,326,104 bytes maximum slop
769 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6999 colls, 0 par 1.35s 1.51s 0.0002s 0.0037s
Gen 1 6 colls, 0 par 1.06s 2.16s 0.3601s 1.3175s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.89s ( 2.07s elapsed)
GC time 2.41s ( 3.67s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.01s ( 0.08s elapsed)
Total time 4.31s ( 5.82s elapsed)
%GC time 55.9% (63.0% elapsed)
Alloc rate 1,942,816,558 bytes per MUT second
Productivity 44.1% of total user, 32.6% of total elapsed
我有兴趣了解如何减少查询时间,如下面的配置文件输出所示:
COST CENTRE MODULE %time %alloc
ts3.\ Main 54.1 91.4
ts3.\.element Main 19.0 2.9
ts3 Main 11.0 2.9
kv.k Main 6.5 1.4
kv.v Main 5.2 1.4
kv.k.\ Main 4.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 72 0 0.0 0.0 100.0 100.0
main Main 158 0 0.0 0.0 0.0 0.0
CAF:main Main 143 0 0.0 0.0 84.2 97.1
main Main 144 1 0.0 0.0 84.2 97.1
ts3 Main 145 1 11.0 2.9 84.2 97.1
ts3.\ Main 156 5000000 54.1 91.4 73.2 94.3
ts3.\.element Main 157 5000000 19.0 2.9 19.0 2.9
CAF:kv Main 142 0 0.0 0.0 0.0 0.0
代码
-- ghc -O2 --make test8.hs -prof -auto-all -caf-all -fforce-recomp +RTS
-- ./test8 +RTS -p
{-# LANGUAGE BangPatterns #-}
import qualified Data.HashMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip (k) (v)
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts :: V.Vector (Int,Int) -> M.HashMap Int Int
ts vec =
V.foldl' (\ht (k, v) -> M.insertWith (+) k v ht) M.empty vec
ts2 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts2 vec =
V.foldl (\ht (!k, !v) -> M.insertWith (++) k [v] ht) M.empty vec
ts3 :: V.Vector (Int, Int) -> M.HashMap Int [Int]
ts3 vec =
V.foldl (\ht (!k, !v) ->
let
element = M.lookup k ht in
case element of
Nothing -> M.insert k [v] ht
Just aList -> M.insert k (v:aList) ht) M.empty vec
ts4 :: V.Vector (Int,Int) -> M.HashMap Int [Int]
ts4 vec =
let initMap = V.foldl (\ht (!k,_) -> M.insert k [] ht) M.empty vec
in
V.foldl (\ht (!k, !v) -> M.adjust(\x -> v:x) k ht) initMap vec
main :: IO ()
main = ts3 kv `seq` putStrLn "calling ts3 done."
main1 = do
if x == y then
putStrLn "Algos Match"
else
putStrLn "Error"
where
x = ts2 kv
y = ts4 kv
答案 2 :(得分:1)
你的大部分问题都归因于(++)
导致大量无用的复制中间列表"正如Carl在答案中所说的那样。在替换(++)
时使用了一些不同的方法,到目前为止,我从Data.IntMap.Strict
切换到containers
得到了最好的结果(只是为了利用较不严格的API - 我不会&# 39;知道哪个实现更高效本身)并使用其alter
函数预先添加向量元素而不创建单例列表:
import qualified Data.IntMap.Strict as M
import qualified Data.Vector.Unboxed as V
n :: Int
n = 5000000
kv :: V.Vector (Int,Int)
kv = V.zip k v
where
k = V.generate n (\i -> i `mod` 1000)
v = V.generate n (\i -> i)
ts2 :: V.Vector (Int,Int) -> M.IntMap [Int]
ts2 vec =
V.foldl' (\ht (k, v) -> M.alter (prep v) k ht) M.empty vec
where
prep x = Just . maybe [x] (x:)
main :: IO ()
main = print $ M.foldl' (+) 0 $ M.map length $ ts2 kv
第二个最佳解决方案是使用
\ht (k, v) -> M.insertWith (\(x:_) -> (x :)) k [v] ht
作为折叠运算符。这适用于Data.IntMap.Strict
和Data.HashMap.Strict
,结果表现相似。
N.B。:请注意,在所有情况下,包括原始实现,向量元素被预先添加到列表中,而不是附加到列表中。如果要附加元素,则问题会更加严重,因为重复附加到(++)
的空列表是元素数量的二次方。