attoparsec高内存使用率读取大文件

时间:2015-11-07 14:20:47

标签: performance parsing haskell memory attoparsec

我已经尝试了很多方法逐行解析文件内容,但目前无法正常工作,运行时会占用大量内存(超过16GB)。

这是我要解析的文件的一个子集http://lpaste.net/144719

我想要三种错误:

1)回溯错误(多行,第一行如3))​​
2)单行多一行的错误
3)单行错误

这是我目前的代码:

import qualified Data.ByteString as B
import Data.ByteString.Char8 as B8 hiding (lines, filter, unlines, head, readFile, take, length,
                                           putStrLn, tail, map, concat, or, writeFile, intersperse,
                                           groupBy, hGetContents)
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Data.Attoparsec.Text hiding (take)
import Control.Applicative
import Control.Monad (replicateM, mapM)
import Data.Either (either)
import Data.List (intersperse, groupBy)
import System.Environment
import qualified System.IO as SIO

data TimeStamp = MkTimeStamp T.Text
               deriving Show

data LogFileInfo = BackTraceLineInfo T.Text
                 | BackTraceInfo TimeStamp T.Text T.Text [LogFileInfo]
                 | Error TimeStamp T.Text
                 | LargeError TimeStamp T.Text T.Text
                 deriving Show

data LineType = SingleLineError TimeStamp T.Text
              | DirectoryInfo T.Text
              | ErrorInfo T.Text
              | LineBackTraceInfo T.Text
              | BackTraceString T.Text
              | BackTraceLine T.Text
              deriving Show

parseTimeStamp :: Parser TimeStamp
parseTimeStamp = do
  year <- many digit
  char '-'
  month <- many digit
  char '-'
  day <- many digit
  char ' '
  hour <- many digit
  char ':'
  minute <- many digit
  char ':'
  second <- many digit
  char ' '
  (return . MkTimeStamp) $ T.pack $ year ++ "-" ++ month ++ "-" ++ day ++ " " ++ hour ++ ":" ++ minute ++ ":" ++ second

parseError :: Parser LineType
parseError = do
  string $ T.pack "ERROR - "
  timeStamp <- parseTimeStamp
  errorInfo <- parseAnyLine
  return $ SingleLineError timeStamp errorInfo

parseDirectoryInfo :: Parser LineType
parseDirectoryInfo = do
  char '/'
  directoryInfo <- parseAnyLine
  (return . DirectoryInfo) $ T.append (T.pack "/") directoryInfo

parseErrorInfo :: Parser LineType
parseErrorInfo = do
  errorInfo <- parseAnyLine
  (return . ErrorInfo) errorInfo

parseBackTraceString :: Parser LineType
parseBackTraceString = do
  let backTraceStr = T.pack " Backtrace: "
  string backTraceStr
  return $ BackTraceString backTraceStr

parseBacktraceLine :: Parser LineType
parseBacktraceLine = do
  char '#'
  number <- many1 digit
  backTraceInfo <- parseAnyLine
  let numberPart = T.pack $ '#' : number
  return $ LineBackTraceInfo $ T.append numberPart backTraceInfo

parseAnyLine :: Parser T.Text
parseAnyLine = fmap T.pack $ many anyChar

-- Skips n lines for allowing other parsers to succeed
skipNLines n = replicateM n $ manyTill anyChar endOfLine

-- performParser :: Parser a -> T.Text -> BackTraceInfo
performParser = parseOnly

getEitherRight :: Either a b -> b
getEitherRight (Right b) = b

parseLogFile :: [T.Text] -> [LineType]
parseLogFile textxs = 
  let listaEithers = mapM (parseOnly $
                           try parseError
                       <|> try parseDirectoryInfo
                       <|> try parseBacktraceLine
                       <|> try parseBackTraceString
                       <|> parseErrorInfo) textxs
  in getEitherRight listaEithers

customUnlines :: [String] -> String
customUnlines []     = []
customUnlines (x:xs) = if x == "\n"
                         then '\n':customUnlines xs
                         else x ++ "\n" ++ customUnlines xs

main = do
  (fileName : _) <- getArgs
  h <- SIO.openFile fileName SIO.ReadMode
  SIO.hSetEncoding h SIO.latin1
  fileContents <- SIO.hGetContents h
  let titleLength           = length fileName
      titleWithoutExtension = take (titleLength - 4) fileName
      allNonEmptyLines      = map T.pack $ intersperse "\n" $ tail $ filter (/= "") $ lines fileContents -- [T.Text]
      listParseResults      = parseLogFile allNonEmptyLines -- [LineType]
      -- onlyModelErrors       = filter isModelError parseResult -- [LogFileInfo]
      -- onlyOneRepresentative = map head $ groupBy equalErrors onlyModelErrors
      listOfStrings         = map show listParseResults
  writeFile (titleWithoutExtension ++ ".logsummary") $ customUnlines listOfStrings

第一个问题是解析器没有解析任何东西。第二个问题是使用16GB的RAM。如何改进我的方法?

1 个答案:

答案 0 :(得分:2)

至少有两个问题 - writeFilecustomUnlines

writeFile需要在编写之前收集所有输出,所以我首先看看它是否会产生输出:

h <- openFile "summary.txt" WriteMode
forM_ listOfStrings (hPutStrLn h)
hClose h

如果listOfStrings是一个惰性列表,这应该以流方式处理日志文件。

假设这有效,为了实现你的customUnlines逻辑,我会这样做:

h <- openFile "summary.txt" WriteMode
forM_ listOfStrings $ \x -> do
  if x == "\n"
    then hPutStr h "\n"
    else hPutStrLn h "\n"
hClose h

如果listOfStrings不是懒惰列表,那么我需要你的导入来进一步调试问题。

<强>更新

由于listOfStringsparseLogFile不是懒惰列表。

请注意,listaEithers的类型为Either String [LineType]。这意味着您必须在返回之前解析所有行。相反,您应该分别解析每一行:

forM_ allNonEmptyLines $ \x -> do
  case parseOnly parseLogLine x of
    Left e -> error "oops"
    Right a -> print a       -- a is a LineType

此处parseLogLine是:

parseLogLine =
  try parseError
  <|> try parseDirectoryInfo
  <|> try parseBacktraceLine
  <|> try parseBackTraceString
  <|> parseErrorInfo