以下是我要解析的文件示例:
XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO : 7
BEG PER: 03/17/2014 CURRENT COMPANY 03/18/2014
END PER: 03/18/2014 QA PROCESS - REJECT REPORT 20:28:36
BATCH: 123456789 CONTRIB: 987654321 - ABCDE FGHI-SAN DIEGO
QUOTE BACK: 1A23B45C79
CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR
------ -------------------- --- -------------------- -------- -------- ---
12345 1234567890001 AB ABCDE FGHI PRODUCTS 20140314 20140914 059
XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO : 8
BEG PER: 03/17/2014 CURRENT COMPANY 03/18/2014
END PER: 03/18/2014 QA PROCESS - REJECT REPORT 20:28:36
BATCH: 234567890 CONTRIB: 987654321 - ABCDE FGHI-SAN DIEGO
QUOTE BACK: 5F7A657G87
CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR
------ -------------------- --- -------------------- -------- -------- ---
12346 2345678901 AB ABCDE FGHI PRODUCTS 20140129 20140729 059
12346 3456789012 AB ABCDE FGHI PRODUCTS 20140317 20140917 059
XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO : 9
BEG PER: 03/17/2014 CURRENT COMPANY 03/18/2014
END PER: 03/18/2014 QA PROCESS - REJECT REPORT 20:28:36
BATCH: 345678901 CONTRIB: 987654321 - ABCDE FGHI-SAN DIEGO
QUOTE BACK: 6K75L8791L
CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR
------ -------------------- --- -------------------- -------- -------- ---
12346 4567890123 AB ABCDE FGHI PRODUCTS 20140317 20140917 059
12346 4567890123 AB ABCDE FGHI PRODUCTS 20140317 20140917 059
NUMBER OF SETS REJECTED ARE : 13 TOTAL SETS IN BATCH: 16,940
*** END OF REPORT ***
以下是我的模块中的一系列片段:
module XX00135 (parseFile) where
import Control.Applicative ((<$>), (<*>), (<*))
import Text.ParserCombinators.Parsec hiding (Line)
data Line = Line { code :: String
, account :: String
, aType :: String
, company :: String
, begDate :: String
, endDate :: String
, errCode :: String }
data Page = Page { periodBeginning :: String
, periodEnd :: String
, reportDate :: String
, batch :: String
, contrib :: String
, quoteBack :: String
, lineList :: [Line] }
data Report = Report { pages :: [Page] }
parseReportDate :: Parser String
parseReportDate =
manyTill anyChar (string "CURRENT COMPANY") >> spaces >> count 10 anyChar
headers :: Parser String
headers =
choice [ try (string "\n")
, try (string "CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR")
, try (string "------ -------------------- --- -------------------- -------- -------- ---") ]
line :: Parser Line
line =
Line <$> count 6 anyChar <* space
<*> count 20 anyChar <* space
<*> count 3 anyChar <* space
<*> count 20 anyChar <* space
<*> count 8 anyChar <* space
<*> count 8 anyChar <* space
<*> count 3 anyChar <* newline
page :: Parser Page
page =
Page <$> (manyTill anyChar (string "BEG PER:") >> space >> count 10 anyChar)
<*> parseReportDate
<*> (manyTill anyChar (string "END PER:") >> space >> count 10 anyChar)
<*> (manyTill anyChar (string "BATCH:") >> space >> count 9 anyChar)
<*> (space >> string "CONTRIB:" >> space >> count 9 anyChar)
<*> (manyTill anyChar (string "QUOTE BACK:") >> space >> count 10 anyChar
<* skipMany1 headers)
<*> (manyTill line (twoNewLines <|> footer))
report :: Parser Report
report = Report <$> manyTill page (try footer)
twoNewLines :: Parser ()
twoNewLines = (count 2 newline) >> return ()
footer :: Parser ()
footer = (space >> string "NUMBER OF SETS REJECTED" >> manyTill anyChar (string "*** END OF REPORT ***") >> optional eof) >> return ()
parseFile :: [(String, String)] -> String -> String
parseFile errors text =
let rs = case parse (manyTill report eof) "" text of
...
完整文件中有115行。当我cat
文件并将其传送到我的haskell时,我得到:
(line 116, column 1);
unexpected end of input
expecting "BEG PER:"
我只是忽略了页脚和随后的任何内容。但我的完整用例是cat
多个文件和管道到我的haskell,这意味着我不能丢弃页脚及其后的所有内容。一旦我开始试图忽略页脚而不是丢弃它,我的问题就开始了。这可能是一件简单的事情,我只是感到困惑和过度看待显而易见的东西。
如果您需要更多代码,请与我们联系。我在解析之后做了一些转换,我不想用不必要的细节来混淆代码。
谢谢!
答案 0 :(得分:1)
我已经解决了这个问题。代码有点不同,我不确定究竟是什么解决了这个问题。我花了很多时间盯着代码,并在这里和那里做一点改变。不过,我认为这与cat
将newline
附加到文件有关。所以我改变了footer
:
footer = space >> string "NUMBER OF SETS REJECTED"
>> anyChar `manyTill` (string "*** END OF REPORT ***") >> newline >> string ""
现在页脚在文件末尾消耗额外的newline
,并返回一个字符串。我在footer
(页面末尾)中使用eop
:
eop =
choice [ count 2 newline
, footer ]
我在eop
的最后一行使用page
:
<*> line `manyTill` eop
report
现在是:
report = count 2 newline >> Report <$> many page
我也改变了page
。我认为它以意想不到的方式消耗anyChar
。所以现在我扔掉了每一页的第一行:
page = firstLine >>
Page <$> (string "BEG PER:" >> space >> count 10 anyChar)
...
firstLine =
string "XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO :"
>> spaces > many digit >> newline
我认为这涵盖了我所做的所有重要更改,最终使解析成功。它现在解析cat
命令中的单个文件,以及cat
命令连接的多个文件。好极了!我爱哈斯克尔。
答案 1 :(得分:0)
看起来页面消耗页脚:
<*> (manyTill line (twoNewLines <|> footer))
因此报告不会消耗页脚:
report = Report <$> manyTill page (try footer)
也许你需要'sepBy'来识别你''页面'之间的'twoNewLines'(没有最后的许多帖子)。