我有一组打包到文件中的二进制记录,我正在使用Data.ByteString.Lazy和Data.Binary.Get读取它们。使用我当前的实现,8Mb文件需要6秒才能解析。
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade { timestamp :: Int, price :: Int , qty :: Int } deriving (Show)
getTrades = do
empty <- isEmpty
if empty
then return []
else do
timestamp <- getWord32le
price <- getWord32le
qty <- getWord16le
rest <- getTrades
let trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
return (trade : rest)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = runGet getTrades input
print $ length trades
我可以做些什么来加快速度?
答案 0 :(得分:20)
稍微重构(基本上是左折)可以提供更好的性能并降低GC开销,相当多地解析一个8388600字节文件。
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade
{ timestamp :: {-# UNPACK #-} !Int
, price :: {-# UNPACK #-} !Int
, qty :: {-# UNPACK #-} !Int
} deriving (Show)
getTrade :: Get Trade
getTrade = do
timestamp <- getWord32le
price <- getWord32le
qty <- getWord16le
return $! Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
countTrades :: BL.ByteString -> Int
countTrades input = stepper (0, input) where
stepper (!count, !buffer)
| BL.null buffer = count
| otherwise =
let (trade, rest, _) = runGetState getTrade buffer 0
in stepper (count+1, rest)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = countTrades input
print trades
相关的运行时统计信息。即使分配数量很接近,GC和最大堆大小在修订版之间也会有所不同。
此处的所有示例均使用GHC 7.4.1 -O2构建。
原始源,由于堆栈空间过多而使用+ RTS -K1G -RTS运行:
426,003,680 bytes allocated in the heap 443,141,672 bytes copied during GC 99,305,920 bytes maximum residency (9 sample(s)) 203 MB total memory in use (0 MB lost due to fragmentation) Total time 0.62s ( 0.81s elapsed) %GC time 83.3% (86.4% elapsed)
丹尼尔的修订:
357,851,536 bytes allocated in the heap 220,009,088 bytes copied during GC 40,846,168 bytes maximum residency (8 sample(s)) 85 MB total memory in use (0 MB lost due to fragmentation) Total time 0.24s ( 0.28s elapsed) %GC time 69.1% (71.4% elapsed)
这篇文章:
290,725,952 bytes allocated in the heap 109,592 bytes copied during GC 78,704 bytes maximum residency (10 sample(s)) 2 MB total memory in use (0 MB lost due to fragmentation) Total time 0.06s ( 0.07s elapsed) %GC time 5.0% (6.0% elapsed)
答案 1 :(得分:17)
你的代码在不到一秒的时间内解码了一个8MB的文件(ghc-7.4.1) - 我当然是用-O2
编译的。
但是,它需要过多的堆栈空间。你可以减少
需要在适当的位置添加更严格的内容,并使用累加器来收集解析得最远的交易。
{-# LANGUAGE BangPatterns #-}
module Main (main) where
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
data Trade = Trade { timestamp :: {-# UNPACK #-} !Int
, price :: {-# UNPACK #-} !Int
, qty :: {-# UNPACK #-} !Int
} deriving (Show)
getTrades :: Get [Trade]
getTrades = go []
where
go !acc = do
empty <- isEmpty
if empty
then return $! reverse acc
else do
!timestamp <- getWord32le
!price <- getWord32le
!qty <- getWord16le
let !trade = Trade (fromIntegral timestamp) (fromIntegral price) (fromIntegral qty)
go (trade : acc)
main :: IO()
main = do
input <- BL.readFile "trades.bin"
let trades = runGet getTrades input
print $ length trades
严格和解包确保通过引用应该已经忘记的ByteString
的一部分来确保没有任何工作可以回来咬你。
如果您需要Trade
具有惰性字段,您仍然可以通过具有严格字段的类型进行解码,并通过结果列表进行map
转换,以便从更严格的解码中受益。
但是,代码仍然花费大量时间进行垃圾收集,因此可能仍需要进一步改进。