我使用attoparsec编写了一个日志文件解析器。我所有较小的解析器都是成功的,组合的最终解析器也是如此。我用tests证实了这一点。但是我对使用已解析的流执行操作感到磕磕绊。
我首先尝试将成功解析的输入传递给函数。但似乎得到的只是Done ()
,我认为这意味着此时已经消耗了日志文件。
prepareStats :: Result Log -> IO ()
prepareStats r =
case r of
Fail _ _ _ -> putStrLn $ "Parsing failed"
Done _ parsedLog -> putStrLn "Success" -- This now has a [LogEntry] array. Do something with it.
main :: IO ()
main = do
[f] <- getArgs
logFile <- B.readFile (f :: FilePath)
let results = parseOnly parseLog logFile
putStrLn "TBC"
我想在消耗输入时从日志文件中累积一些统计信息。例如,我正在解析响应代码,我想计算有多少2 **响应和多少4/5 **响应。我正在解析每个响应作为Ints返回的字节数,我想有效地对它们求和(听起来像foldl'
?)。我已经定义了这样的数据类型:
data Stats = Stats {
successfulRequestsPerMinute :: Int
, failingRequestsPerMinute :: Int
, meanResponseTime :: Int
, megabytesPerMinute :: Int
} deriving Show
我想在解析输入时不断更新。但是我消耗的操作部分是我遇到困难的地方。到目前为止,print
是我成功传递输出的唯一函数,并且在打印输出之前返回Done
表明解析成功。
我的主解析器看起来像这样:
parseLogEntry :: Parser LogEntry
parseLogEntry = do
ip <- logItem
_ <- char ' '
logName <- logItem
_ <- char ' '
user <- logItem
_ <- char ' '
time <- datetimeLogItem
_ <- char ' '
firstLogLine <- quotedLogItem
_ <- char ' '
finalRequestStatus <- intLogItem
_ <- char ' '
responseSizeB <- intLogItem
_ <- char ' '
timeToResponse <- intLogItem
return $ LogEntry ip logName user time firstLogLine finalRequestStatus responseSizeB timeToResponse
type Log = [LogEntry]
parseLog :: Parser Log
parseLog = many $ parseLogEntry <* endOfLine
我想将每个已解析的行传递给将更新上述数据类型的函数。理想情况下,我希望这是非常高效的内存,因为它将在大文件上运行。
答案 0 :(得分:2)
您必须使您的单元解析单个日志条目而不是日志条目列表。
它不漂亮,但这是一个如何交错解析和处理的例子:
(取决于bytestring
,attoparsec
和mtl
)
{-# LANGUAGE NoMonomorphismRestriction, FlexibleContexts #-}
import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 hiding (takeWhile)
import Data.Char
import Control.Monad.State.Strict
aWord :: Parser BS.ByteString
aWord = skipSpace >> A.takeWhile isAlphaNum
getNext :: MonadState [a] m => m (Maybe a)
getNext = do
xs <- get
case xs of
[] -> return Nothing
(y:ys) -> put ys >> return (Just y)
loop iresult =
case iresult of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword; loop (parse aWord x')
Partial _ -> do
mx <- getNext
case mx of
Just y -> loop (feed iresult y)
Nothing -> case feed iresult BS.empty of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword; return ()
Partial _ -> error $ "partial returned" -- probably can't happen
process :: Show a => a -> IO ()
process w = putStrLn $ "got a word: " ++ show w
theWords = map BS.pack [ "this is a te", "st of the emergency ", "broadcasting sys", "tem"]
main = runStateT (loop (Partial (parse aWord))) theWords
注意:
aWord
,并在识别出每个单词后致电process
。feed
在返回Partial
时为解析器提供更多输入。Done
时,处理已识别的字词并继续parse aWord
。getNext
只是一个monadic函数的示例,它获取下一个输入单位。将其替换为您自己的版本 - 即从文件中读取下一行的内容。以下是使用parseWith
作为@dfeuer建议的解决方案:
noMoreInput = fmap null get
loop2 x = do
iresult <- parseWith (fmap (fromMaybe BS.empty) getNext) aWord x
case iresult of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword;
if BS.null x'
then do b <- noMoreInput
if b then return ()
else loop2 x'
else loop2 x'
Partial _ -> error $ "huh???" -- this really can't happen
main2 = runStateT (loop2 BS.empty) theWords
答案 1 :(得分:1)
如果每个日志条目恰好是一行,这是一个更简单的解决方案:
do loglines <- fmap BS.lines $ BS.readfile "input-file.log"
foldl' go initialStats loglines
where
go stats logline =
case parseOnly yourParser logline of
Left e -> error $ "oops: " ++ e
Right r -> let stats' = ... combine r with stats ...
in stats'
基本上你只是逐行阅读文件并在每一行上调用parseOnly
并累积结果。
答案 2 :(得分:1)
这是通过流媒体库
正确完成的main = do
f:_ <- getArgs
withFile f ReadMode $ \h -> do
result <- foldStream $ streamProcess $ streamHandle h
print result
where
streamHandle = undefined
streamProcess = undefined
foldStream = undefined
任何流媒体库都可以填充空白,例如
import qualified Pipes.Prelude as P
import Pipes
import qualified Pipes.ByteString as PB
import Pipes.Group (folds)
import qualified Control.Foldl as L
import Control.Lens (view) -- or import Lens.Simple (view), or whatever
streamHandle = Pipes.ByteStream.fromHandle :: Handle -> Producer ByteString IO ()
在那种情况下,我们可能会进一步分工:
streamProcess :: Producer ByteString m r -> Producer LogEntry m r
streamProcess p = streamLines p >-> lineParser
streamLines :: Producer ByteString m r -> Producer ByteString m r
streamLines p = L.purely fold L.list (view (Pipes.ByteString.lines p)) >-> P.map B.toStrict
lineParser :: Pipe ByteString LogEntry m r
lineParser = P.map (parseOnly line_parser) >-> P.concat -- concat removes lefts
(这有点费力,因为管道对于累积行和记忆一般是明智的:我们只是试图获得个别严格的字符串行的生成器,然后将其转换为解析行的生成器,然后到抛出糟糕的解析,如果有的话。使用io-streams或管道,事情将基本相同,并且特定的步骤将更容易。)
我们现在可以折叠我们的Producer LogEntry IO ()
。这可以使用Pipes.Prelude.fold
明确地完成,这会产生严格的左侧折叠。在这里,我们将从user5402
foldStream str = P.fold go initial_stats id
where
go stats_till_now new_entry = undefined
如果您习惯使用foldl
库并将折叠应用到具有L.purely fold some_fold
的Producer,那么您可以为组件中的LogEntries构建Control.Foldl.Fold
s并根据需要插入不同的请求。
如果您使用pipes-attoparsec
并在解析器中包含换行符,那么您只需编写
handleToLogEntries :: Handle -> Producer LogEntry IO ()
handleToLogEntries h = void $ parsed my_line_parser (fromHandle h) >-> P.concat
直接获取Producer LogEntry IO ()
。 (然而,这种超简单的编写方式将停止在一个糟糕的解析中;首先划分行将比使用attoparsec识别换行更快。)这对于io-stream也很简单,你会写一些像< / p>
import qualified System.IO.Streams as Streams
io :: Handle -> IO ()
io h = do
bytes <- Streams.handleToInputStream h
log_entries <- Streams.parserToInputStream my_line_parser bytes
fold_result <- Stream.fold go initial_stats log_entries
print fold_result
或与上述结构保持一致:
where
streamHandle = Streams.handleToInputStream
streamProcess io_bytes =
io_bytes >>= Streams.parserToInputStream my_line_parser
foldStream io_logentries =
log_entries >>= Stream.fold go initial_stats
无论哪种方式,my_line_parser
都应该返回Maybe LogEntry
并且应该识别换行符。