如何在Haskell中将大数据块解析为内存?

时间:2014-03-20 06:37:27

标签: performance haskell space-leak

经过反思,这整个问题可归结为更简洁的问题。我正在寻找一个

的Haskell数据结构
  • 看起来像一个列表
  • 有O(1)查找
  • 有O(1)元素替换 O(1)元素追加(或前置...如果是这种情况,我可以反转我的索引查找)。我总是可以用一个或另一个来编写我以后的算法。
  • 内存开销很小

我正在尝试构建一个图像文件解析器。文件格式是基本的8位彩色ppm文件,但我打算支持16位彩色文件和PNG和JPEG文件。现有的Netpbm库,尽管有很多拆箱注释,但在尝试加载我使用的文件时,实际上会消耗所有可用内存:

3-10张照片,最小的是45MB,最大的是110MB。

现在,我无法理解Netpbm代码中的优化,所以我决定自己尝试一下。它是一种简单的文件格式......

我已经开始决定无论文件格式是什么,我都会以这种格式存储未压缩的最终图像:

import Data.Vector.Unboxed (Vector)
data PixelMap = RGB8 {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    }

然后我编写了一个解析器,它可以处理三个向量,如下所示:

import Data.Attoparsec.ByteString
data Progress = Progress {
      addr      :: Int
    , size      :: Int
    , redC      :: Vector Word8
    , greenC    :: Vector Word8
    , blueC     :: Vector Word8
    }

parseColorBinary :: Progress -> Parser Progress
parseColorBinary progress@Progress{..}
    | addr == size = return progress
    | addr < size = do
        !redV <- anyWord8
        !greenV <- anyWord8
        !blueV <- anyWord8
        parseColorBinary progress { addr    = addr + 1
                                  , redC    = redC V.// [(addr, redV)]
                                  , greenC  = greenC V.// [(addr, greenV)]
                                  , blueC   = blueC V.// [(addr, blueV)] }

在解析器的最后,我像这样构造RGB8:

Progress{..} <- parseColorBinary $ ...
return $ RGB8 width height redC greenC blueC

这样编写,加载这些45MB图像中的一个的程序将消耗5GB或更多的内存。如果我更改Progress的定义,以便redCgreenCblueC都是!(Vector Word8),那么程序仍然在合理的内存范围内,但需要这么长时间加载一个我还没有让它真正完成的文件。最后,如果我用标准列表替换这里的向量,我的内存使用量会回升到每个文件5GB(我假设......实际上我用完了空间之前用完了空间),加载时间是大约6秒。 Ubuntu的预览应用程序一旦启动,几乎可以立即加载和呈现文件。

根据理论,每次调用V.//实际上是每次都完全复制向量,我尝试切换到Data.Vector.Unboxed.Mutable,但是......我甚至无法得到它来进行类型检查。文档是不存在的,并且了解数据类型正在做什么也需要与多个其他库进行斗争。而且我甚至不知道它是否会解决问题,所以我甚至都不愿意尝试。

基本问题实际上非常简单:

如何快速,无需使用大量内存,读取,保留并可能操纵非常大的数据结构?我发现的所有例子都是关于暂时生成大量数据,然后尽快摆脱它。

原则上,我希望最终的表示是不可改变的,但如果我必须使用可变结构来实现目标,我就不会太在意。


为了完整起见,完整的代码(BSD3许可)位于https://bitbucket.org/savannidgerinel/photo-tools的bitbucket上。 performance分支包含严格版本的解析器,可以通过快速更改Progress Codec.Image.Netpbm的{​​{1}}数据结构来使其变得不严格。

运行性能测试

ulimit -Sv 6000000 -- set a ulimit of 6GB, or change to whatever makes sense for you
cabal build
dist/build/perf-test/perf-test +RTS -p -sstderr

2 个答案:

答案 0 :(得分:4)

我首先想到的只是简单地读取整个bytestring的块然后将内容解压缩到未装箱的向量就足够了。实际上,即使没有神秘的空间泄漏,您发布的解析代码也会相当糟糕:您在输入的每个字节上复制所有三个向量的全部内容!谈论二次复杂性。

所以我写了以下内容:

chunksOf3 :: [a] -> [(a, a, a)]
chunksOf3 (a:b:c:xs) = (a, b, c) : chunksOf3 xs
chunksOf3 _          = []

parseRGB :: Int -> Atto.Parser (Vector Word8, Vector Word8, Vector Word8)
parseRGB size = do
    input <- Atto.take (size * 3)
    let (rs, gs, bs) = unzip3 $ chunksOf3 $ B.unpack input
    return (V.fromList rs, V.fromList gs, V.fromList bs)

然后使用45 Mb随机字节文件对其进行测试。我承认我很惊讶这段代码导致了数十亿字节的RAM使用率。我很好奇空间泄漏的确切位置。

可变载体虽然效果很好。以下代码使用133 Mb RAM和Criterion基准测试,包括60 ms文件读取。我在评论中加入了一些解释。在ST monad和SO和其他地方的可变载体上也有充足的材料(我同意图书馆文件对初学者不友好)。

import Data.Vector.Unboxed (Vector)
import Data.ByteString (ByteString)

import qualified Data.Vector.Unboxed as V
import qualified Data.ByteString as B
import qualified Data.Vector.Unboxed.Mutable as MV

import Control.Monad.ST.Strict 
import Data.Word
import Control.Monad
import Control.DeepSeq

-- benchmarking stuff
import Criterion.Main (defaultMainWith, bench, whnfIO)
import Criterion.Config (defaultConfig, Config(..), ljust)

-- This is just the part that parses the three vectors for the colors.
-- Of course, you can embed this into an Attoparsec computation by taking 
-- the current input, feeding it to parseRGB, or you can just take the right 
-- sized chunk in the parser and omit the "Maybe" test from the code below. 
parseRGB :: Int -> ByteString -> Maybe (Vector Word8, Vector Word8, Vector Word8)
parseRGB size input 
    | 3* size > B.length input = Nothing
    | otherwise = Just $ runST $ do

        -- We are allocating three mutable vectors of size "size"
        -- This is usually a bit of pain for new users, because we have to
        -- specify the correct type somewhere, and it's not an exactly simple type.
        -- In the ST monad there is always an "s" type parameter that labels the
        -- state of the action. A type of "ST s something" is a bit similar to
        -- "IO something", except that the inner type often also contains "s" as
        -- parameter. The purpose of that "s" is to statically disallow mutable
        -- variables from escaping the ST action. 
        [r, g, b] <- replicateM 3 $ MV.new size :: ST s [MV.MVector s Word8]

        -- forM_ = flip mapM_
        -- In ST code forM_ is a nicer looking approximation of the usual
        -- imperative loop. 
        forM_ [0..size - 1] $ \i -> do
            let i' = 3 * i
            MV.unsafeWrite r i (B.index input $ i'    )
            MV.unsafeWrite g i (B.index input $ i' + 1)
            MV.unsafeWrite b i (B.index input $ i' + 2)

        -- freeze converts a mutable vector living in the ST monad into 
        -- a regular vector, which can be then returned from the action
        -- since its type no longer depends on that pesky "s".
        -- unsafeFreeze does the conversion in place without copying.
        -- This implies that the original mutable vector should not be used after
        -- unsafeFreezing. 
        [r, g, b] <- mapM V.unsafeFreeze [r, g, b]
        return (r, g, b)

-- I prepared a file with 3 * 15 million random bytes.
inputSize = 15000000
benchConf = defaultConfig {cfgSamples = ljust 10}

main = do
    defaultMainWith benchConf (return ()) $ [
        bench "parseRGB test" $ whnfIO $ do 
            input <- B.readFile "randomInp.dat" 
            force (parseRGB inputSize input) `seq` putStrLn "done"
        ]

答案 1 :(得分:3)

这是一个直接从磁盘解析文件而不将任何中间件加载到内存中的版本:

import Control.Applicative
import Control.Monad (void)
import Data.Attoparsec.ByteString (anyWord8)
import Data.Attoparsec.ByteString.Char8 (decimal)
import qualified Data.Attoparsec.ByteString as Attoparsec
import Data.ByteString (ByteString)
import Data.Vector.Unboxed (Vector)
import Data.Word (Word8)
import Control.Foldl (FoldM(..), impurely, vector, premapM) -- Uses `foldl-1.0.3`
import qualified Pipes.ByteString
import Pipes.Parse
import Pipes.Attoparsec (parse, parsed)
import qualified System.IO as IO

data PixelMap = PixelMap {
      width :: Int
    , height :: Int
    , redChannel :: Vector Word8
    , greenChannel :: Vector Word8
    , blueChannel :: Vector Word8
    } deriving (Show)

-- Fold three vectors simultaneously, ensuring strictness and efficiency
rgbVectors
    :: FoldM IO (Word8, Word8, Word8) (Vector Word8, Vector Word8, Vector Word8)
rgbVectors =
    (,,) <$> premapM _1 vector <*> premapM _2 vector <*> premapM _3 vector
  where
    _1 (a, b, c) = a
    _2 (a, b, c) = b
    _3 (a, b, c) = c

triples
    :: Monad m
    => Producer ByteString m r
    -> Producer (Word8, Word8, Word8) m ()
triples p = void $ parsed ((,,) <$> anyWord8 <*> anyWord8 <*> anyWord8) p

-- I will probably ask Renzo to simplify the error handling for `parse`
-- This is a helper function to just return `Nothing`
parse'
    :: Monad m
    => Attoparsec.Parser r -> Parser ByteString m (Maybe r)
parse' parser = do
    x <- parse parser
    return $ case x of
        Just (Right r) -> Just r
        _              -> Nothing

parsePixelMap :: Producer ByteString IO r -> IO (Maybe PixelMap)
parsePixelMap p = do
    let parseWH = do
            mw <- parse' decimal
            mh <- parse' decimal
            return ((,) <$> mw <*> mh)
    (x, p') <- runStateT parseWH p
    case x of
        Nothing     -> return Nothing
        Just (w, h) -> do
            let size = w * h
                parser = impurely foldAllM rgbVectors
                source = triples (p' >-> Pipes.ByteString.take size)
            (rs, gs, bs) <- evalStateT parser source
            return $ Just (PixelMap w h rs gs bs)

main = IO.withFile "image.ppm" IO.ReadMode $ \handle -> do
    pixelMap <- parsePixelMap (Pipes.ByteString.fromHandle handle)
    print pixelMap

我在没有50 MB文件的标题逻辑的情况下对其进行了测试,并且它运行的空间大致相同。