Question

我应该首先说明我是Haskell和管道库的初学者，我想了解导致test函数中该程序的高内存使用率的原因。

特别是在r1中产生test值的折叠中，我看到MyRecord值的累积，直到产生最终结果，除非使用deepseq。在我的~500000行/ ~230 MB的样本数据集中，内存使用量增长超过1.5 GB。

产生r2值的折叠在常量内存中运行。

我想了解的是：

1）在第一次折叠中可能导致MyMemory值构建的原因是什么，以及为什么使用deepseq会修复它？我非常随意地向它扔东西，直到使用deepseq来实现常量内存使用，但想了解它的工作原理。是否可以在不使用deepseq的情况下实现常量内存使用，同时仍然生成相同的结果类型Maybe Int？

2）。第二次折叠有什么不同，导致它不会出现同样的问题？

我知道如果我只使用整数而不是元组，我可以使用Pipes.Prelude中的内置sum函数，但我最终会想要处理包含任何解析错误的第二个元素。 / p>

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ScopedTypeVariables #-}

module Test where

import           Control.Arrow
import           Control.DeepSeq
import           Control.Monad
import           Data.Aeson
import           Data.Function
import           Data.Maybe
import           Data.Monoid
import           Data.Text (Text)

import           Pipes
import qualified Pipes.Aeson as PA (DecodingError(..))
import qualified Pipes.Aeson.Unchecked as PA
import qualified Pipes.ByteString as PB
import qualified Pipes.Group as PG
import qualified Pipes.Parse as PP
import qualified Pipes.Prelude as P

import           System.IO
import           Control.Lens
import qualified Control.Foldl as Fold

data MyRecord = MyRecord
  { myRecordField1 :: !Text
  , myRecordField2 :: !Int
  , myRecordField3 :: !Text
  , myRecordField4 :: !Text
  , myRecordField5 :: !Text
  , myRecordField6 :: !Text
  , myRecordField7 :: !Text
  , myRecordField8 :: !Text
  , myRecordField9 :: !Text
  , myRecordField10 :: !Int
  , myRecordField11 :: !Text
  , myRecordField12 :: !Text
  , myRecordField13 :: !Text
  } deriving (Eq, Show)

instance FromJSON MyRecord where
  parseJSON (Object o) =
    MyRecord <$> o .: "field1" <*> o .: "field2" <*> o .: "field3" <*>
    o .: "field4" <*>
    o .: "field5" <*>
    o .: "filed6" <*>
    o .: "field7" <*>
    o .: "field8" <*>
    o .: "field9" <*>
    (read <$> o .: "field10") <*>
    o .: "field11" <*>
    o .: "field12" <*>
    o .: "field13"
  parseJSON x = fail $ "MyRecord: expected Object, got: " <> show x

instance ToJSON MyRecord where
    toJSON _ = undefined

test :: IO ()
test = do
  withFile "some-file" ReadMode $ \hIn
  {-

      the pipeline is composed as follows:

      1 a producer reading a file with Pipes.ByteString, splitting chunks into lines,
        and parsing the lines as JSON to produce tuples of (Maybe MyRecord, Maybe
        ByteString), the second element being an error if parsing failed

      2 a pipe filtering that tuple on a field of Maybe MyRecord, passing matching
        (Maybe MyRecord, Maybe ByteString) downstream

      3 and a pipe that picks an Int field out of Maybe MyRecord, passing (Maybe Int,
        Maybe ByteString downstream)

      pipeline == 1 >-> 2 >-> 3

      memory profiling indicates the memory build up is due to accumulation of
      MyRecord "objects", and data types comprising their fields (mainly
      Text/ARR_WORDS)

  -}
   -> do
    let pipeline = f1 hIn >-> f2 >-> f3
    -- need to use deepseq to avoid leaking memory
    r1 <-
      P.fold
        (\acc (v, _) -> (+) <$> acc `deepseq` acc <*> pure (fromMaybe 0 v))
        (Just 0)
        id
        (pipeline :: Producer (Maybe Int, Maybe PB.ByteString) IO ())
    print r1
    hSeek hIn AbsoluteSeek 0
    -- this works just fine as is and streams in constant memory
    r2 <-
      P.fold
        (\acc v ->
           case fst v of
             Just x -> acc + x
             Nothing -> acc)
        0
        id
        (pipeline :: Producer (Maybe Int, Maybe PB.ByteString) IO ())
    print r2
    return ()
  return ()

f1
  :: (FromJSON a, MonadIO m)
  => Handle -> Producer (Maybe a, Maybe PB.ByteString) m ()
f1 hIn = PB.fromHandle hIn & asLines & resumingParser PA.decode

f2
  :: Pipe (Maybe MyRecord, Maybe PB.ByteString) (Maybe MyRecord, Maybe PB.ByteString) IO r
f2 = filterRecords (("some value" ==) . myRecordField5)

f3 :: Pipe (Maybe MyRecord, d) (Maybe Int, d) IO r
f3 = P.map (first (fmap myRecordField10))

filterRecords
  :: Monad m
  => (MyRecord -> Bool)
  -> Pipe (Maybe MyRecord, Maybe PB.ByteString) (Maybe MyRecord, Maybe PB.ByteString) m r
filterRecords predicate =
  for cat $ \(l, e) ->
    when (isNothing l || (predicate <$> l) == Just True) $ yield (l, e)

asLines
  :: Monad m
  => Producer PB.ByteString m x -> Producer PB.ByteString m x
asLines p = Fold.purely PG.folds Fold.mconcat (view PB.lines p)

parseRecords
  :: (Monad m, FromJSON a, ToJSON a)
  => Producer PB.ByteString m r
  -> Producer a m (Either (PA.DecodingError, Producer PB.ByteString m r) r)
parseRecords = view PA.decoded

resumingParser
  :: Monad m
  => PP.StateT (Producer a m r) m (Maybe (Either e b))
  -> Producer a m r
  -> Producer (Maybe b, Maybe a) m ()
resumingParser parser p = do
  (x, p') <- lift $ PP.runStateT parser p
  case x of
    Nothing -> return ()
    Just (Left _) -> do
      (x', p'') <- lift $ PP.runStateT PP.draw p'
      yield (Nothing, x')
      resumingParser parser p''
    Just (Right b) -> do
      yield (Just b, Nothing)
      resumingParser parser p'

Answer 1

正如中所提到的，折叠是严格的。然而，严格性是docs for Pipes.foldl，只会迫使评估到WHNF - 弱头正常形式。 WHNF足以完全评估一个简单的类似于Int，但它不足以完全评估更多复杂类型，如DataPolicy。

一些例子：

Maybe Int

在第一种情况下，变量main1 = do let a = 3 + undefined b = seq a 10 print b -- error: Exception: Prelude.undefined main2 = do let a = Just (3 + undefined) b = seq a 10 print b -- no exception是大thunk的acc - 所有元素的总和。在每次迭代中变量Just 从acc到Just a到Just (a+b)等。添加在折叠期间没有进行 - 它只是在完成最后。大量内存使用来自存储这种不断增长的总和在记忆中。

在第二种情况下，求和每次迭代减少Just (a+b+c)到一个简单的Int。

除了使用$!之外，您还可以使用deepseq：

force

和implemented with $!，与ViewPatterns结合使用可以创建一个完全评估函数参数的模式：

force x = x `deepseq` x

了解这个Haskell程序的内存使用情况

1 个答案: