Parsec - 错误“combinator'more'应用于接受空字符串的解析器”

时间:2011-10-13 12:16:29

标签: haskell parsec

我正在尝试使用Parsec编写一个解析器,它将解析有文化的Haskell文件,如下所示:

The classic 'Hello, world' program.

\begin{code}

main = putStrLn "Hello, world"

\end{code}

More text.

我写过以下内容,受到RWH中的例子的启发:

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "hello.lhs"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    deriving (Show)


parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input



literateFile
    = many codeOrProse

codeOrProse
    = code <|> prose

code
    = do eol
         string "\\begin{code}"
         eol
         content <- many anyChar
         eol
         string "\\end{code}"
         eol
         return $ Haskell content

prose
    = do content <- many anyChar
         return $ Text content

eol
    =   try (string "\n\r")
    <|> try (string "\r\n")
    <|> string "\n"
    <|> string "\r"
    <?> "end of line"

我希望这会产生以下内容:

[Text "The classic 'Hello, world' program.", Haskell "main = putStrLn \"Hello, world\"", Text "More text."]

(允许空白等)。

这个编译很好,但是在运行时,我收到错误:

*** Exception: Text.ParserCombinators.Parsec.Prim.many: combinator 'many' is applied to a parser that accepts an empty string

任何人都可以对此有所了解,并可能帮助解决方案吗?

3 个答案:

答案 0 :(得分:8)

正如......指出many anyChar是问题所在。但不仅仅是prose,还有code。问题codecontent <- many anyChar将消耗所有内容:换行符和\end{code}标记。

所以,你需要有一些方法来分辨散文和代码。一种简单(但可能太天真)的方法是寻找反斜杠:

literateFile = many codeOrProse <* eof

code = do string "\\begin{code}"
          content <- many $ noneOf "\\"
          string "\\end{code}"
          return $ Haskell content

prose = do content <- many1 $ noneOf "\\"
           return $ Text content

现在,您没有完全获得所需的结果,因为Haskell部分也会包含换行符,但您可以非常轻松地过滤掉这些内容(给定一个函数filterNewlines,您可以说{{ 1}})。

修改

好的,我想我找到了一个解决方案(需要最新的Parsec版本,因为`content <- filterNewlines <$> (many $ noneOf "\\")):

lookAhead

import Text.ParserCombinators.Parsec import Control.Applicative hiding (many, (<|>)) main = do contents <- readFile "hello.lhs" let results = parseLiterate contents print results data Element = Text String | Haskell String deriving (Show) parseLiterate :: String -> Either ParseError [Element] parseLiterate input = parse literateFile "" input literateFile = many codeOrProse codeOrProse = code <|> prose code = do string "\\begin{code}\n" c <- untilP (string "\\end{code}\n") string "\\end{code}\n" return $ Haskell c prose = do t <- untilP $ (string "\\begin{code}\n") <|> (eof >> return "") return $ Text t untilP p = do s <- many $ noneOf "\n" newline s' <- try (lookAhead p >> return "") <|> untilP p return $ s ++ s' 解析一行,然后检查untilP p是否可以成功解析下一行的开头。如果是这样,它返回空字符串,否则继续。需要p,否则将消耗begin \ end-tags并且lookAhead无法识别它们。

我想它仍然可以更简洁(即不必在code内重复string "\\end{code}\n"。)

答案 1 :(得分:6)

我没有测试过,但是:

  • many anyChar可以匹配空字符串
  • 因此prose可以匹配空字符串
  • 因此codeOrProse可以匹配空字符串
  • 因此literateFile可以永远循环,匹配无数多个空字符串

prose更改为匹配many1字符可能会解决此问题。

(我对Parsec不是很熟悉,但是prose将如何知道它应该匹配多少个字符?它可能会消耗整个输入,从不给{{1}解析器第二次有机会查找新代码段的开头。或者它可能只匹配每个调用中的一个字符,使code / many无效。)

答案 2 :(得分:0)

作为参考,这是我提出的另一个版本(略微扩展以处理其他情况):

import Text.ParserCombinators.Parsec

main
    = do contents <- readFile "test.tex"
         let results = parseLiterate contents
         print results

data Element
    = Text String
    | Haskell String
    | Section String
    deriving (Show)

parseLiterate :: String -> Either ParseError [Element]

parseLiterate input
    = parse literateFile "(unknown)" input

literateFile
    = do es <- many elements
         eof
         return es

elements
    = try section
  <|> try quotedBackslash
  <|> try code
  <|> prose

code
    = do string "\\begin{code}"
         c <- anyChar `manyTill` try (string "\\end{code}")
         return $ Haskell c

quotedBackslash
    = do string "\\\\"
         return $ Text "\\\\"

prose
    = do t <- many1 (noneOf "\\")
         return $ Text t

section
    = do string "\\section{"
         content <- many1 (noneOf "}")
         char '}'
         return $ Section content