使用attoparsec对解析的数据进行操作

时间:2015-09-08 21:41:25

标签: parsing haskell attoparsec

背景

我使用attoparsec编写了一个日志文件解析器。我所有较小的解析器都是成功的,组合的最终解析器也是如此。我用tests证实了这一点。但是我对使用已解析的流执行操作感到磕磕绊。

我尝试了什么

我首先尝试将成功解析的输入传递给函数。但似乎得到的只是Done (),我认为这意味着此时已经消耗了日志文件。

prepareStats :: Result Log -> IO ()
prepareStats r =
case r of
    Fail _ _ _ -> putStrLn $ "Parsing failed"
    Done _ parsedLog -> putStrLn "Success" -- This now has a [LogEntry] array. Do something with it.

main :: IO ()
main = do
[f] <- getArgs
logFile <- B.readFile (f :: FilePath)
let results = parseOnly parseLog logFile
putStrLn "TBC"

我正在尝试做什么

我想在消耗输入时从日志文件中累积一些统计信息。例如,我正在解析响应代码,我想计算有多少2 **响应和多少4/5 **响应。我正在解析每个响应作为Ints返回的字节数,我想有效地对它们求和(听起来像foldl'?)。我已经定义了这样的数据类型:

data Stats = Stats {
    successfulRequestsPerMinute :: Int
  , failingRequestsPerMinute    :: Int
  , meanResponseTime            :: Int
  , megabytesPerMinute          :: Int
  } deriving Show

我想在解析输入时不断更新。但是我消耗的操作部分是我遇到困难的地方。到目前为止,print是我成功传递输出的唯一函数,并且在打印输出之前返回Done表明解析成功。

我的主解析器看起来像这样:

parseLogEntry :: Parser LogEntry
parseLogEntry = do
ip <- logItem
_ <- char ' '
logName <- logItem
_ <- char ' '
user <- logItem
_ <- char ' '
time <- datetimeLogItem
_ <- char ' '
firstLogLine <- quotedLogItem
_ <- char ' '
finalRequestStatus <- intLogItem
_ <- char ' '
responseSizeB <- intLogItem
_ <- char ' '
timeToResponse <- intLogItem
return $ LogEntry ip logName user time firstLogLine finalRequestStatus responseSizeB timeToResponse

type Log = [LogEntry]

parseLog :: Parser Log
parseLog = many $ parseLogEntry <* endOfLine

期望的结果

我想将每个已解析的行传递给将更新上述数据类型的函数。理想情况下,我希望这是非常高效的内存,因为它将在大文件上运行。

3 个答案:

答案 0 :(得分:2)

您必须使您的单元解析单个日志条目而不是日志条目列表。

它不漂亮,但这是一个如何交错解析和处理的例子:

(取决于bytestringattoparsecmtl

{-# LANGUAGE NoMonomorphismRestriction, FlexibleContexts #-}

import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 hiding (takeWhile)
import Data.Char
import Control.Monad.State.Strict

aWord :: Parser BS.ByteString
aWord = skipSpace >> A.takeWhile isAlphaNum

getNext :: MonadState [a] m => m (Maybe a)
getNext = do
  xs <- get
  case xs of
    [] -> return Nothing
    (y:ys) -> put ys >> return (Just y)

loop iresult =
  case iresult of
    Fail _ _ msg  -> error $ "parse failed: " ++ msg
    Done x' aword -> do lift $ process aword; loop (parse aWord x')
    Partial _     -> do
      mx <- getNext
      case mx of
        Just y  -> loop (feed iresult y)
        Nothing -> case feed iresult BS.empty of
                     Fail _ _ msg  -> error $ "parse failed: " ++ msg
                     Done x' aword -> do lift $ process aword; return ()
                     Partial _     -> error $ "partial returned"  -- probably can't happen

process :: Show a => a -> IO ()
process w = putStrLn $ "got a word: " ++ show w

theWords = map BS.pack [ "this is a te", "st of the emergency ", "broadcasting sys", "tem"]


main = runStateT (loop (Partial (parse aWord))) theWords

注意:

  • 我们一次解析aWord,并在识别出每个单词后致电process
  • 使用feed在返回Partial时为解析器提供更多输入。
  • 当没有剩余输入时,将解析器输入一个空字符串。
  • 返回Done时,处理已识别的字词并继续parse aWord
  • getNext只是一个monadic函数的示例,它获取下一个输入单位。将其替换为您自己的版本 - 即从文件中读取下一行的内容。

更新

以下是使用parseWith作为@dfeuer建议的解决方案:

noMoreInput = fmap null get

loop2 x = do
  iresult <- parseWith (fmap (fromMaybe BS.empty) getNext) aWord x
  case iresult of
    Fail _ _ msg  -> error $ "parse failed: " ++ msg
    Done x' aword -> do lift $ process aword;
                        if BS.null x'
                           then do b <- noMoreInput
                                   if b then return ()
                                        else loop2 x'
                           else loop2 x'
    Partial _     -> error $ "huh???" -- this really can't happen

main2 = runStateT (loop2 BS.empty) theWords

答案 1 :(得分:1)

如果每个日志条目恰好是一行,这是一个更简单的解决方案:

do loglines <- fmap BS.lines $ BS.readfile "input-file.log"
   foldl' go initialStats loglines
   where
     go stats logline = 
        case parseOnly yourParser logline of
          Left e  -> error $ "oops: " ++ e
          Right r -> let stats' = ... combine r with stats ...
                     in stats'

基本上你只是逐行阅读文件并在每一行上调用parseOnly并累积结果。

答案 2 :(得分:1)

这是通过流媒体库

正确完成的
main = do
  f:_ <- getArgs
  withFile f ReadMode $ \h -> do
       result <- foldStream $ streamProcess $ streamHandle h
       print result
where
 streamHandle  = undefined
 streamProcess = undefined
 foldStream    = undefined

任何流媒体库都可以填充空白,例如

 import qualified Pipes.Prelude as P
 import Pipes
 import qualified Pipes.ByteString as PB
 import Pipes.Group (folds)
 import qualified Control.Foldl as L
 import Control.Lens (view) -- or import Lens.Simple (view), or whatever

 streamHandle =  Pipes.ByteStream.fromHandle :: Handle -> Producer ByteString IO ()

在那种情况下,我们可能会进一步分工:

 streamProcess :: Producer ByteString m r -> Producer LogEntry m r
 streamProcess p =  streamLines p >-> lineParser

 streamLines :: Producer ByteString m r -> Producer ByteString m r
 streamLines p = L.purely fold L.list (view (Pipes.ByteString.lines p)) >-> P.map B.toStrict

 lineParser :: Pipe ByteString LogEntry m r
 lineParser = P.map (parseOnly line_parser) >-> P.concat -- concat removes lefts

(这有点费力,因为管道对于累积行和记忆一般是明智的:我们只是试图获得个别严格的字符串行的生成器,然后将其转换为解析行的生成器,然后到抛出糟糕的解析,如果有的话。使用io-streams或管道,事情将基本相同,并且特定的步骤将更容易。)

我们现在可以折叠我们的Producer LogEntry IO ()。这可以使用Pipes.Prelude.fold明确地完成,这会产生严格的左侧折叠。在这里,我们将从user5402

中删除结构
 foldStream str = P.fold go initial_stats id
  where
   go stats_till_now new_entry = undefined

如果您习惯使用foldl库并将折叠应用到具有L.purely fold some_fold的Producer,那么您可以为组件中的LogEntries构建Control.Foldl.Fold s并根据需要插入不同的请求。

如果您使用pipes-attoparsec并在解析器中包含换行符,那么您只需编写

 handleToLogEntries :: Handle -> Producer LogEntry IO ()
 handleToLogEntries h = void $ parsed my_line_parser (fromHandle h) >-> P.concat

直接获取Producer LogEntry IO ()。 (然而,这种超简单的编写方式将停止在一个糟糕的解析中;首先划分行将比使用attoparsec识别换行更快。)这对于io-stream也很简单,你会写一些像< / p>

import qualified System.IO.Streams as Streams

io :: Handle -> IO ()
io h = do  
    bytes <- Streams.handleToInputStream h
    log_entries <- Streams.parserToInputStream my_line_parser bytes
    fold_result <- Stream.fold go initial_stats log_entries
    print fold_result

或与上述结构保持一致:

 where 
  streamHandle = Streams.handleToInputStream
  streamProcess io_bytes = 
      io_bytes >>= Streams.parserToInputStream my_line_parser
  foldStream io_logentries =
      log_entries >>= Stream.fold go initial_stats 

无论哪种方式,my_line_parser都应该返回Maybe LogEntry并且应该识别换行符。