我正在尝试解析Wikipedia的XML转储,以使用Haskell Parsec库在每个页面上查找某些链接。链接用双括号表示:texttext[[link]]texttext
。为了尽可能地简化场景,假设我正在寻找没有用双花括号括起来的第一个链接(可以嵌套):{{ {{ [[Wrong Link]] }} [[Wrong Link]] }} [[Right Link]]
。我编写了一个解析器来丢弃包含在非嵌套双括号中的链接:
import Text.Parsec
getLink :: String -> Either ParseError String
getLink = parse linkParser "Links"
linkParser = do
beforeLink
link <- many $ noneOf "]"
string "]]"
return link
beforeLink = manyTill (many notLink) (try $ string "[[")
notLink = try doubleCurlyBrac <|> (many1 normalText)
normalText = noneOf "[{"
<|> notFollowedByItself '['
<|> notFollowedByItself '{'
notFollowedByItself c = try ( do x <- char c
notFollowedBy $ char c
return x)
doubleCurlyBrac = between (string "{{") (string "}}") (many $ noneOf "}")
getLinkTest = fmap getLink testList
where testList = [" [[rightLink]] " --Correct link is found
, " {{ [[Wrong_Link]] }} [[rightLink]]" --Correct link is found
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]" ] --Wrong link is found
我尝试递归doubleCurlyBrac
解析器也会丢弃嵌套花括号中的链接,但没有成功:
doubleCurlyBrac = between (string "{{") (string "}}") betweenBraces
where betweenBraces = doubleCurlyBrac <|> (many $ try $ noneOf "}")
在嵌套示例中,此解析器在第一个}}
之后停止使用输入,而不是最后一个。是否有一种优雅的方法来编写递归解析器(在这种情况下)正确忽略嵌套双花括号中的链接?此外,是否可以在不使用try
的情况下完成?我发现,由于try
不消耗输入,它通常会导致解析器挂起意外的,格式错误的输入。
答案 0 :(得分:2)
这是一个更直接的版本,不使用自定义词法分析器。它确实使用了try
,我不知道如何避免它。问题是,似乎我们需要一个不承诺的前瞻性来区分双括号和单括号; try
用于未提前展望。
高级方法与in相同
my first answer。我一直很小心
使三个节点解析器通勤 - 使代码更健壮
要改变 - 同时使用try
和notFollowedBy
:
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (string)
import qualified Text.Parsec
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)
import Debug.Trace
----------------------------------------------------------------------
-- Token parsers.
llink , rlink , lbrace , rbrace :: Parsec String u String
[llink , rlink , lbrace , rbrace] = reserved
reserved = map (try . Text.Parsec.string) ["[[" , "]]" , "{{" , "}}"]
----------------------------------------------------------------------
-- Node parsers.
-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show
nodes :: Parsec String u [Node]
nodes = many node
node :: Parsec String u Node
node = link <|> braces <|> string
link , braces , string :: Parsec String u Node
link = L <$> between llink rlink nodes
braces = B <$> between lbrace rbrace nodes
string = S <$> many1 (notFollowedBy (choice reserved) >> anyChar)
----------------------------------------------------------------------
parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>"
----------------------------------------------------------------------
-- Tests.
getLink :: [Node] -> Maybe Node
getLink = find isLink where
isLink (L _) = True
isLink _ = False
parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes
testList = [ " [[rightLink]] "
, " {{ [[Wrong_Link]] }} [[rightLink]]"
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]"
, " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
-- Pathalogical example from comments.
, "{{ab}cd}}"
-- A more pathalogical example.
, "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf"
-- No top level link.
, "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
-- Too many '{{'.
, "{{ {{ {{ [[ asdf ]] }} }}"
-- Too many '}}'.
, "{{ {{ [[ asdf ]] }} }} }}"
-- Too many '[['.
, "[[ {{ [[{{[[asdf]]}}]]}}"
]
main =
forM_ testList $ \ t -> do
putStrLn $ "Test: ^" ++ t ++ "$"
let parses = ( , ) <$> parseNodes t <*> parseLink t
printParses (n , l) = do
putStrLn $ "Nodes: " ++ show n
putStrLn $ "Link: " ++ show l
printError = putStrLn . show
either printError printParses parses
putStrLn ""
在非错误情况下输出相同:
Test: ^ [[rightLink]] $
Nodes: [S " ",L [S "rightLink"],S " "]
Link: Just (L [S "rightLink"])
Test: ^ {{ [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ {{ {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])
Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing
Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])
Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
但解析错误消息在情况下没有提供信息 无与伦比的开口:
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 26):
unexpected "}}"
Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting "[[", "{{", "]]" or "}}"
我无法弄清楚如何修复它们。
答案 1 :(得分:1)
我的解决方案不使用try
,但相对复杂:我使用过
你的问题是学习如何创建词法分析器的借口
Parsec没有使用
makeTokenParser
:D我避免使用try
,因为唯一的展望发生在词法分析器(tokenize
)中,其中识别了各种括号对。
高层次的想法是我们将{{
,}}
,[[
和]]
视为
特殊标记并将输入解析为AST。你没有指定
准确的语法,所以我选择了一个简单的语法来生成你的语法
示例:
node ::= '{{' node* '}}'
| '[[' node* ']]'
| string
string ::= <non-empty string without '{{', '}}', '[[', or ']]'>
我将输入字符串解析为节点列表。第一个顶级
链接([[
)节点(如果有)是您要查找的链接。
我采用的方法应该对语法相对健壮
变化。例如,如果您只想在链接中允许字符串,那么
将'[[' node* ']]'
更改为'[[' string ']]'
。 (在代码中
link = L <$> between llink rlink nodes
变为
link = L <$> between llink rlink string
)。
代码相当长,但大部分都很简单。大多数
关注创建令牌流(lexing)和解析个人
令牌。在此之后,实际的Node
解析非常简单。
这是:
{-# LANGUAGE TupleSections #-}
import Text.Parsec hiding (char , string)
import Text.Parsec.Pos (updatePosString , updatePosChar)
import Control.Applicative ((<$>) , (<*) , (<*>))
import Control.Monad (forM_)
import Data.List (find)
----------------------------------------------------------------------
-- Lexing.
-- Character or punctuation.
data Token = C Char | P String deriving Eq
instance Show Token where
show (C c) = [c]
show (P s) = s
tokenize :: String -> [Token]
tokenize [] = []
tokenize [c] = [C c]
tokenize (c1:c2:cs) = case [c1,c2] of
"[[" -> ts
"]]" -> ts
"{{" -> ts
"}}" -> ts
_ -> C c1 : tokenize (c2:cs)
where
ts = P [c1,c2] : tokenize cs
----------------------------------------------------------------------
-- Token parsers.
-- We update the 'sourcePos' while parsing the tokens. Alternatively,
-- we could have annotated the tokens with positions above in
-- 'tokenize', and then here we would use 'token' instead of
-- 'tokenPrim'.
llink , rlink , lbrace , rbrace :: Parsec [Token] u Token
[llink , rlink , lbrace , rbrace] =
map (t . P) ["[[" , "]]" , "{{" , "}}"]
where
t x = tokenPrim show update match where
match y = if x == y then Just x else Nothing
update pos (P s) _ = updatePosString pos s
char :: Parsec [Token] u Char
char = tokenPrim show update match where
match (C c) = Just c
match (P _) = Nothing
update pos (C c) _ = updatePosChar pos c
----------------------------------------------------------------------
-- Node parsers.
-- Link, braces, or string.
data Node = L [Node] | B [Node] | S String deriving Show
nodes :: Parsec [Token] u [Node]
nodes = many node
node :: Parsec [Token] u Node
node = link <|> braces <|> string
link , braces , string :: Parsec [Token] u Node
link = L <$> between llink (rlink <?> "]]") nodes
braces = B <$> between lbrace (rbrace <?> "}}") nodes
string = S <$> many1 char
----------------------------------------------------------------------
parseNodes :: String -> Either ParseError [Node]
parseNodes = parse (nodes <* eof) "<no file>" . tokenize
----------------------------------------------------------------------
-- Tests.
getLink :: [Node] -> Maybe Node
getLink = find isLink where
isLink (L _) = True
isLink _ = False
parseLink :: String -> Either ParseError (Maybe Node)
parseLink = either Left (Right . getLink) . parseNodes
testList = [ " [[rightLink]] "
, " {{ [[Wrong_Link]] }} [[rightLink]]"
, " {{ {{ }} [[Wrong_Link]] }} [[rightLink]]"
, " [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}"
-- Pathalogical example from comments.
, "{{ab}cd}}"
-- A more pathalogical example.
, "{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf"
-- No top level link.
, "{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}"
-- Too many '{{'.
, "{{ {{ {{ [[ asdf ]] }} }}"
-- Too many '}}'.
, "{{ {{ [[ asdf ]] }} }} }}"
-- Too many '[['.
, "[[ {{ [[{{[[asdf]]}}]]}}"
]
main =
forM_ testList $ \ t -> do
putStrLn $ "Test: ^" ++ t ++ "$"
let parses = ( , ) <$> parseNodes t <*> parseLink t
printParses (n , l) = do
putStrLn $ "Nodes: " ++ show n
putStrLn $ "Link: " ++ show l
printError = putStrLn . show
either printError printParses parses
putStrLn ""
main
的输出是:
Test: ^ [[rightLink]] $
Nodes: [S " ",L [S "rightLink"],S " "]
Link: Just (L [S "rightLink"])
Test: ^ {{ [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ {{ {{ }} [[Wrong_Link]] }} [[rightLink]]$
Nodes: [S " ",B [S " ",B [S " "],S " ",L [S "Wrong_Link"],S " "],S " ",L [S "rightLink"]]
Link: Just (L [S "rightLink"])
Test: ^ [[{{[[someLink]]}}]] {{}} {{[[asdf]]}}$
Nodes: [S " ",L [B [L [S "someLink"]]],S " ",B [],S " ",B [L [S "asdf"]]]
Link: Just (L [B [L [S "someLink"]]])
Test: ^{{ab}cd}}$
Nodes: [B [S "ab}cd"]]
Link: Nothing
Test: ^{ [ { {asf{[[[asdfa]]]}aasdff ] ] ] {{[[asdf]]}}asdf$
Nodes: [S "{ [ { {asf{",L [S "[asdfa"],S "]}aasdff ] ] ] ",B [L [S "asdf"]],S "asdf"]
Link: Just (L [S "[asdfa"])
Test: ^{{[[Wrong_Link]]asdf[[WRong_Link]]{{}}}}{{[[[[Wrong]]]]}}$
Nodes: [B [L [S "Wrong_Link"],S "asdf",L [S "WRong_Link"],B []],B [L [L [S "Wrong"]]]]
Link: Nothing
Test: ^{{ {{ {{ [[ asdf ]] }} }}$
"<no file>" (line 1, column 26):
unexpected end of input
expecting }}
Test: ^{{ {{ [[ asdf ]] }} }} }}$
"<no file>" (line 1, column 24):
unexpected }}
expecting end of input
Test: ^[[ {{ [[{{[[asdf]]}}]]}}$
"<no file>" (line 1, column 25):
unexpected end of input
expecting ]]