在我的语法中,我想为字符串允许2种语法:
"my \"string\""
,在这里没问题。|"my "string"|"
,|x"my |"string"|x"
。其目的是保持字符串内容无任何转义,例如,当x(ht)ml文件中包含js片段时,绝对不要出现a && b
之类的东西。本着我的精神,我希望表达以下内容:
'|' {$Boundary} '"' {AnyCharSequenceExcept('|' $Boundary '"')} '|' {$Boundary} '"'
我知道我无法在标准ANTLR4中做到这一点。可以通过动作来做到吗?
答案 0 :(得分:0)
这是一种实现方法:
lexer grammar DemoLexer;
@members {
def ahead(self, steps):
"""
Returns the next `steps` characters ahead in the character-stream or None if
there aren't `steps` characters ahead aymore
"""
text = ""
for n in range(1, steps + 1):
next = self._input.LA(n)
if next == Token.EOF:
return None
text += chr(next)
return text
def consume_until(self, open_tag):
"""
If we get here, it means the lexer matched an opening tag, and we now consume as
much characters until we match the corresponsing closing tag
"""
while True:
ahead = self.ahead(len(open_tag))
if ahead == None:
raise Exception("missing '{}' close tag".format(open_tag))
if ahead == open_tag:
break
self._input.consume()
# Be sure to consume the end_tag, which has the same character count as `open_tag`
for n in range(0, len(open_tag)):
self._input.consume()
}
STRING
: '|' ~'"'* '"' {self.consume_until(self.text)}
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
如果您是根据上述语法生成词法分析器并运行以下(Python)脚本:
from antlr4 import *
from DemoLexer import DemoLexer
source = """
foo |x"my |"string"|x" bar
"""
lexer = DemoLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()
for token in stream.tokens[:-1]:
print("{0:<25} '{1}'".format(DemoLexer.symbolicNames[token.type], token.text))
以下内容将打印到您的控制台:
OTHER 'f'
OTHER 'o'
OTHER 'o'
STRING '|x"my |"string"|x"'
OTHER 'b'
OTHER 'a'
OTHER 'r'