我正在尝试使用Parsec编写一个解析器,它将解析有文化的Haskell文件,如下所示:
The classic 'Hello, world' program.
\begin{code}
main = putStrLn "Hello, world"
\end{code}
More text.
我写过以下内容,受到RWH中的例子的启发:
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= many codeOrProse
codeOrProse
= code <|> prose
code
= do eol
string "\\begin{code}"
eol
content <- many anyChar
eol
string "\\end{code}"
eol
return $ Haskell content
prose
= do content <- many anyChar
return $ Text content
eol
= try (string "\n\r")
<|> try (string "\r\n")
<|> string "\n"
<|> string "\r"
<?> "end of line"
我希望这会产生以下内容:
[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]
(允许空白等)。
这个编译很好,但是在运行时,我收到错误:
*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string
任何人都可以对此有所了解,并可能帮助解决方案吗?
答案 0 :(得分:8)
正如......指出many anyChar
是问题所在。但不仅仅是prose
,还有code
。问题code
是content <- many anyChar
将消耗所有内容:换行符和\end{code}
标记。
所以,你需要有一些方法来分辨散文和代码。一种简单(但可能太天真)的方法是寻找反斜杠:
literateFile = many codeOrProse <* eof
code = do string "\\begin{code}"
content <- many $ noneOf "\\"
string "\\end{code}"
return $ Haskell content
prose = do content <- many1 $ noneOf "\\"
return $ Text content
现在,您没有完全获得所需的结果,因为Haskell
部分也会包含换行符,但您可以非常轻松地过滤掉这些内容(给定一个函数filterNewlines
,您可以说{{ 1}})。
修改
好的,我想我找到了一个解决方案(需要最新的Parsec版本,因为`content <- filterNewlines <$> (many $ noneOf "\\")
):
lookAhead
import Text.ParserCombinators.Parsec
import Control.Applicative hiding (many, (<|>))
main
= do contents <- readFile "hello.lhs"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "" input
literateFile
= many codeOrProse
codeOrProse = code <|> prose
code = do string "\\begin{code}\n"
c <- untilP (string "\\end{code}\n")
string "\\end{code}\n"
return $ Haskell c
prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "")
return $ Text t
untilP p = do s <- many $ noneOf "\n"
newline
s' <- try (lookAhead p >> return "") <|> untilP p
return $ s ++ s'
解析一行,然后检查untilP p
是否可以成功解析下一行的开头。如果是这样,它返回空字符串,否则继续。需要p
,否则将消耗begin \ end-tags并且lookAhead
无法识别它们。
我想它仍然可以更简洁(即不必在code
内重复string "\\end{code}\n"
。)
答案 1 :(得分:6)
我没有测试过,但是:
many anyChar
可以匹配空字符串prose
可以匹配空字符串codeOrProse
可以匹配空字符串literateFile
可以永远循环,匹配无数多个空字符串将prose
更改为匹配many1
字符可能会解决此问题。
(我对Parsec不是很熟悉,但是prose
将如何知道它应该匹配多少个字符?它可能会消耗整个输入,从不给{{1}解析器第二次有机会查找新代码段的开头。或者它可能只匹配每个调用中的一个字符,使code
/ many
无效。)
答案 2 :(得分:0)
作为参考,这是我提出的另一个版本(略微扩展以处理其他情况):
import Text.ParserCombinators.Parsec
main
= do contents <- readFile "test.tex"
let results = parseLiterate contents
print results
data Element
= Text String
| Haskell String
| Section String
deriving (Show)
parseLiterate :: String -> Either ParseError [Element]
parseLiterate input
= parse literateFile "(unknown)" input
literateFile
= do es <- many elements
eof
return es
elements
= try section
<|> try quotedBackslash
<|> try code
<|> prose
code
= do string "\\begin{code}"
c <- anyChar `manyTill` try (string "\\end{code}")
return $ Haskell c
quotedBackslash
= do string "\\\\"
return $ Text "\\\\"
prose
= do t <- many1 (noneOf "\\")
return $ Text t
section
= do string "\\section{"
content <- many1 (noneOf "}")
char '}'
return $ Section content