我正在构建ANTLR4语法来解析自定义语言,如下所示:
start rule_set {
/foo/bar {
//some_rules
}
}
/foo/bar
是类似于URL的路径,因此它可能包含转义字符(例如%20
)和其他符号。但是rule_set
部分是普通标识符,%
不应在其中。
这是我目前的语法:
grammar TEST;
start: 'start' IDENTIFIER block EOF;
block: LBRACE matcher* RBRACE;
matcher: matchPath matchBlock;
matchBlock: LBRACE RULES RBRACE;
matchPath: ('/' pathSegment)+;
pathSegment: (PATH_CHAR)+;
LBRACE: '{';
RBRACE: '}';
RULES: '//some_rules';
fragment LETTER : 'A'..'Z' | 'a'..'z' ;
fragment DIGIT : '0'..'9' ;
fragment URLHEX: ('%' [a-fA-F0-9] [a-fA-F0-9]);
PATH_CHAR
: URLHEX
| LETTER
| DIGIT
| '-'
| '_'
| '.'
| '!'
| '~'
| '*'
| '\\'
| '\''
| '('
| ')'
| ':'
| '@'
| '&'
| '='
| '+'
| '$'
| ',';
IDENTIFIER: (LETTER | '_') ( LETTER | DIGIT | '_')*;
WS: ( '\t' | ' ' | '\r' | '\n' )+ -> skip;
现在的问题是foo
和bar
的词法最长为IDENTIFIER
,因为它是最长的匹配项。我希望pathSegment
在这种情况下能得到正确的结果。如何解决这种歧义?
[@0,0:4='start',<'start'>,1:0]
[@1,6:13='rule_set',<IDENTIFIER>,1:6]
[@2,15:15='{',<'{'>,1:15]
[@3,21:21='/',<'/'>,2:4]
[@4,22:24='foo',<IDENTIFIER>,2:5]
[@5,25:25='/',<'/'>,2:8]
[@6,26:28='bar',<IDENTIFIER>,2:9]
[@7,30:30='{',<'{'>,2:13]
[@8,40:51='//some_rules',<'//some_rules'>,3:8]
[@9,57:57='}',<'}'>,4:4]
[@10,59:59='}',<'}'>,5:0]
[@11,62:61='<EOF>',<EOF>,7:0]
line 2:5 mismatched input 'foo' expecting PATH_CHAR
line 2:9 mismatched input 'bar' expecting PATH_CHAR