如何在ANTLR4中定义一个带有转义边界的字符串(如multipart mimetype)?

时间:2019-07-06 12:48:28

标签: antlr4

在我的语法中,我想为字符串允许2种语法:

  1. 经典方式"my \"string\"",在这里没问题。
  2. 具有任意转义边界的新方法:|"my "string"|"|x"my |"string"|x"。其目的是保持字符串内容无任何转义,例如,当x(ht)ml文件中包含js片段时,绝对不要出现a && b之类的东西。

本着我的精神,我希望表达以下内容:

'|' {$Boundary} '"' {AnyCharSequenceExcept('|' $Boundary '"')} '|' {$Boundary} '"'

我知道我无法在标准ANTLR4中做到这一点。可以通过动作来做到吗?

1 个答案:

答案 0 :(得分:0)

这是一种实现方法:

lexer grammar DemoLexer;

@members {

def ahead(self, steps):
    """
    Returns the next `steps` characters ahead in the character-stream or None if
    there aren't `steps` characters ahead aymore
    """
    text = ""
    for n in range(1, steps + 1):
        next = self._input.LA(n)
        if next == Token.EOF:
            return None
        text += chr(next)
    return text

def consume_until(self, open_tag):
    """
    If we get here, it means the lexer matched an opening tag, and we now consume as
    much characters until we match the corresponsing closing tag
    """
    while True:
        ahead = self.ahead(len(open_tag))
        if ahead == None:
            raise Exception("missing '{}' close tag".format(open_tag))
        if ahead == open_tag:
            break
        self._input.consume()

    # Be sure to consume the end_tag, which has the same character count as `open_tag`
    for n in range(0, len(open_tag)):
        self._input.consume()

}

STRING
 : '|' ~'"'* '"' {self.consume_until(self.text)}
 ;

SPACES
 : [ \t\r\n] -> skip
 ;

OTHER
 : .
 ;

如果您是根据上述语法生成词法分析器并运行以下(Python)脚本:

from antlr4 import *
from DemoLexer import DemoLexer


source = """
foo |x"my |"string"|x" bar
"""

lexer = DemoLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()

for token in stream.tokens[:-1]:
    print("{0:<25} '{1}'".format(DemoLexer.symbolicNames[token.type], token.text))

以下内容将打印到您的控制台:

OTHER                     'f'
OTHER                     'o'
OTHER                     'o'
STRING                    '|x"my |"string"|x"'
OTHER                     'b'
OTHER                     'a'
OTHER                     'r'