大家好,
这可以是对此问题的跟进:Antlr rule priorities
我正在尝试为reStructuredText markup language写一个ANTLR语法。
我面临的主要问题是:“如何匹配任何字符序列(常规文本)而不屏蔽其他语法规则?”
让我们举一个带内联标记的段落的例子:
In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element
*before* the insert point. The variable ``after_ptr`` points to the element *after* the
insert. In other words, we are going to put our new element **in between** ``before_ptr``
and ``after_ptr``.
我认为编写内联标记文本的规则很容易。所以我写了一个简单的语法:
grammar Rst;
options {
output=AST;
language=Java;
backtrack=true;
//memoize=true;
}
@members {
boolean inInlineMarkup = false;
}
// PARSER
text
: inline_markup (WS? inline_markup)* WS? EOF
;
inline_markup
@after {
inInlineMarkup = false;
}
: {!inInlineMarkup}? (emphasis|strong|litteral|link)
;
emphasis
@init {
inInlineMarkup = true;
}
: '*' (~'*')+ '*' {System.out.println("emphasis: " + $text);}
;
strong
@init {
inInlineMarkup = true;
}
: '**' (~'*')+ '**' {System.out.println("bold: " + $text);}
;
litteral
@init {
inInlineMarkup = true;
}
: '``' (~'`')+ '``' {System.out.println("litteral: " + $text);}
;
link
@init {
inInlineMarkup = true;
}
: inline_internal_target
| footnote_reference
| hyperlink_reference
;
inline_internal_target
: '_`' (~'`')+ '`' {System.out.println("inline_internal_target: " + $text);}
;
footnote_reference
: '[' (~']')+ ']_' {System.out.println("footnote_reference: " + $text);}
;
hyperlink_reference
: ~(' '|'\t'|'\u000C'|'_')+ '_' {System.out.println("hyperlink_reference: " + $text);}
| '`' (~'`')+ '`_' {System.out.println("hyperlink_reference (long): " + $text);}
;
// LEXER
WS
: (' '|'\t'|'\u000C')+
;
NEWLINE
: '\r'? '\n'
;
这个简单的语法不起作用。我甚至没有尝试匹配常规文字......
我的问题:
先谢谢你的帮助: - )
罗宾
非常感谢你的帮助!我本来很难搞清楚我的错误...我不是在编写那种语法(仅)来学习ANTLR,我正在尝试编写一个用于eclipse的IDE插件。为此,我需要一个语法;)
我设法进一步使用语法并编写了text
规则:
grammar Rst;
options {
output=AST;
language=Java;
}
@members {
boolean inInlineMarkup = false;
}
//////////////////
// PARSER RULES //
//////////////////
file
: line* EOF
;
line
: text* NEWLINE
;
text
: inline_markup
| normal_text
;
inline_markup
@after {
inInlineMarkup = false;
}
: {!inInlineMarkup}? {inInlineMarkup = true;}
(
| STRONG
| EMPHASIS
| LITTERAL
| INTERPRETED_TEXT
| SUBSTITUTION_REFERENCE
| link
)
;
link
: INLINE_INTERNAL_TARGET
| FOOTNOTE_REFERENCE
| HYPERLINK_REFERENCE
;
normal_text
: {!inInlineMarkup}?
~(EMPHASIS
|SUBSTITUTION_REFERENCE
|STRONG
|LITTERAL
|INTERPRETED_TEXT
|INLINE_INTERNAL_TARGET
|FOOTNOTE_REFERENCE
|HYPERLINK_REFERENCE
|NEWLINE
)
;
//////////////////
// LEXER TOKENS //
//////////////////
EMPHASIS
: STAR ANY_BUT_STAR+ STAR {System.out.println("EMPHASIS: " + $text);}
;
SUBSTITUTION_REFERENCE
: PIPE ANY_BUT_PIPE+ PIPE {System.out.println("SUBST_REF: " + $text);}
;
STRONG
: STAR STAR ANY_BUT_STAR+ STAR STAR {System.out.println("STRONG: " + $text);}
;
LITTERAL
: BACKTICK BACKTICK ANY_BUT_BACKTICK+ BACKTICK BACKTICK {System.out.println("LITTERAL: " + $text);}
;
INTERPRETED_TEXT
: BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("LITTERAL: " + $text);}
;
INLINE_INTERNAL_TARGET
: UNDERSCORE BACKTICK ANY_BUT_BACKTICK+ BACKTICK {System.out.println("INLINE_INTERNAL_TARGET: " + $text);}
;
FOOTNOTE_REFERENCE
: L_BRACKET ANY_BUT_BRACKET+ R_BRACKET UNDERSCORE {System.out.println("FOOTNOTE_REFERENCE: " + $text);}
;
HYPERLINK_REFERENCE
: BACKTICK ANY_BUT_BACKTICK+ BACKTICK UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (long): " + $text);}
| ANY_BUT_ENDLINK+ UNDERSCORE {System.out.println("HYPERLINK_REFERENCE (short): " + $text);}
;
WS
: (' '|'\t')+ {$channel=HIDDEN;}
;
NEWLINE
: '\r'? '\n' {$channel=HIDDEN;}
;
///////////////
// FRAGMENTS //
///////////////
fragment ANY_BUT_PIPE
: ESC PIPE
| ~(PIPE|'\n'|'\r')
;
fragment ANY_BUT_BRACKET
: ESC R_BRACKET
| ~(R_BRACKET|'\n'|'\r')
;
fragment ANY_BUT_STAR
: ESC STAR
| ~(STAR|'\n'|'\r')
;
fragment ANY_BUT_BACKTICK
: ESC BACKTICK
| ~(BACKTICK|'\n'|'\r')
;
fragment ANY_BUT_ENDLINK
: ~(UNDERSCORE|' '|'\t'|'\n'|'\r')
;
fragment ESC
: '\\'
;
fragment STAR
: '*'
;
fragment BACKTICK
: '`'
;
fragment PIPE
: '|'
;
fragment L_BRACKET
: '['
;
fragment R_BRACKET
: ']'
;
fragment UNDERSCORE
: '_'
;
语法对于inline_markup工作正常但是normal_text不匹配。
这是我的测试类:
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.tree.Tree;
public class Test {
public static void main(String[] args) throws RecognitionException, IOException {
InputStream is = Test.class.getResourceAsStream("test.rst");
Reader r = new InputStreamReader(is);
StringBuilder source = new StringBuilder();
char[] buffer = new char[1024];
int readLenght = 0;
while ((readLenght = r.read(buffer)) > 0) {
if (readLenght < buffer.length) {
source.append(buffer, 0, readLenght);
} else {
source.append(buffer);
}
}
r.close();
System.out.println(source.toString());
ANTLRStringStream in = new ANTLRStringStream(source.toString());
RstLexer lexer = new RstLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
RstParser parser = new RstParser(tokens);
RstParser.file_return out = parser.file();
System.out.println(((Tree)out.getTree()).toStringTree());
}
}
我使用的输入文件:
In `Figure 17-6`_, we have positioned ``before_ptr`` so that it points to the element
*before* the insert point. The variable ``after_ptr`` points to the |element| *after* the
insert. In other words, `we are going`_ to put_ our new element **in between** ``before_ptr``
and ``after_ptr``.
我得到了这个输出:
HYPERLINK_REFERENCE (short): 7-6`_
line 1:2 mismatched character ' ' expecting '_'
line 1:10 mismatched character ' ' expecting '_'
line 1:18 mismatched character ' ' expecting '_'
line 1:21 mismatched character ' ' expecting '_'
line 1:26 mismatched character ' ' expecting '_'
line 1:37 mismatched character ' ' expecting '_'
LITTERAL: `before_ptr`
line 1:86 no viable alternative at character '\r'
line 1:55 mismatched character ' ' expecting '_'
line 1:60 mismatched character ' ' expecting '_'
line 1:63 mismatched character ' ' expecting '_'
line 1:70 mismatched character ' ' expecting '_'
line 1:73 mismatched character ' ' expecting '_'
line 1:77 mismatched character ' ' expecting '_'
line 1:85 mismatched character ' ' expecting '_'
EMPHASIS: *before*
line 2:12 mismatched character ' ' expecting '_'
line 2:19 mismatched character ' ' expecting '_'
line 2:26 mismatched character ' ' expecting '_'
LITTERAL: `after_ptr`
line 2:30 mismatched character ' ' expecting '_'
line 2:39 mismatched character ' ' expecting '_'
line 2:90 no viable alternative at character '\r'
line 2:60 mismatched character ' ' expecting '_'
line 2:63 mismatched character ' ' expecting '_'
line 2:67 mismatched character ' ' expecting '_'
line 2:77 mismatched character ' ' expecting '_'
line 2:85 mismatched character ' ' expecting '_'
line 2:89 mismatched character ' ' expecting '_'
line 3:7 mismatched character ' ' expecting '_'
line 3:10 mismatched character ' ' expecting '_'
line 3:16 mismatched character ' ' expecting '_'
line 3:23 mismatched character ' ' expecting '_'
line 3:27 mismatched character ' ' expecting '_'
line 3:31 mismatched character ' ' expecting '_'
line 3:42 mismatched character ' ' expecting '_'
line 3:51 mismatched character ' ' expecting '_'
line 3:55 mismatched character ' ' expecting '_'
line 3:63 mismatched character ' ' expecting '_'
line 3:94 mismatched character '\r' expecting '*'
line 4:3 mismatched character ' ' expecting '_'
line 4:18 no viable alternative at character '\r'
line 4:18 mismatched character '\r' expecting '_'
HYPERLINK_REFERENCE (short): oing`_
HYPERLINK_REFERENCE (short): ut_
EMPHASIS: *in between*
LITTERAL: `after_ptr`
BR.recoverFromMismatchedToken
line 0:-1 mismatched input '<EOF>' expecting NEWLINE
null
你能指出我的错误吗? (当我添加filter = true;语法选项时,解析器适用于内联标记而没有错误)
罗宾
答案 0 :(得分:6)
以下是 解析此reStructeredText的快速演示。请注意,它只处理所有可用标记语法的一小部分,并且通过向其添加更多内容,将影响现有的解析器/词法分析器规则:所以有很多,很多还有更多工作要做!
grammar RST;
options {
output=AST;
backtrack=true;
memoize=true;
}
tokens {
ROOT;
PARAGRAPH;
INDENTATION;
LINE;
WORD;
BOLD;
ITALIC;
INTERPRETED_TEXT;
INLINE_LITERAL;
REFERENCE;
}
parse
: paragraph+ EOF -> ^(ROOT paragraph+)
;
paragraph
: line+ -> ^(PARAGRAPH line+)
| Space* LineBreak -> /* omit line-breaks between paragraphs from AST */
;
line
: indentation text+ LineBreak -> ^(LINE text+)
;
indentation
: Space* -> ^(INDENTATION Space*)
;
text
: styledText
| interpretedText
| inlineLiteral
| reference
| Space
| Star
| EscapeSequence
| Any
;
styledText
: bold
| italic
;
bold
: Star Star boldAtom+ Star Star -> ^(BOLD boldAtom+)
;
italic
: Star italicAtom+ Star -> ^(ITALIC italicAtom+)
;
boldAtom
: ~(Star | LineBreak)
| italic
;
italicAtom
: ~(Star | LineBreak)
| bold
;
interpretedText
: BackTick interpretedTextAtoms BackTick -> ^(INTERPRETED_TEXT interpretedTextAtoms)
;
interpretedTextAtoms
: ~BackTick+
;
inlineLiteral
: BackTick BackTick inlineLiteralAtoms BackTick BackTick -> ^(INLINE_LITERAL inlineLiteralAtoms)
;
inlineLiteralAtoms
: inlineLiteralAtom+
;
inlineLiteralAtom
: ~BackTick
| BackTick ~BackTick
;
reference
: Any+ UnderScore -> ^(REFERENCE Any+)
;
UnderScore
: '_'
;
BackTick
: '`'
;
Star
: '*'
;
Space
: ' '
| '\t'
;
EscapeSequence
: '\\' ('\\' | '*')
;
LineBreak
: '\r'? '\n'
| '\r'
;
Any
: .
;
当您从上面生成解析器和词法分析器时,让它解析以下输入文件:
***x*** **yyy** *zz* * a b c P2 ``*a*`b`` `q` Python_
(注意尾随换行!)
解析器将生成以下AST:
可以通过运行此类来创建图表:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"***x*** **yyy** *zz* *\n" +
"a b c\n" +
"\n" +
"P2 ``*a*`b`` `q`\n" +
"Python_\n";
RSTLexer lexer = new RSTLexer(new ANTLRStringStream(source));
RSTParser parser = new RSTParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
或者如果您的来源来自文件,请执行以下操作:
RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst"));
或
RSTLexer lexer = new RSTLexer(new ANTLRFileStream("test.rst", "???"));
其中"???"
是您文件的编码。
上面的类会将AST作为DOT文件打印到控制台。您可以使用DOT查看器显示AST。在这种情况下,我发布了由kgraphviewer创建的图像。但有many more viewers around。一个不错的在线版本是this one,它似乎在“引擎盖”下使用kgraphviewer。祝你好运!
答案 1 :(得分:4)
罗宾写道:
我认为编写内联标记文本的规则很容易
我必须承认我不熟悉这种标记语言,但它似乎类似于BB-Code或Wiki标记,它们不易翻译成(ANTLR)语法!这些语言不容易被标记化,因为它取决于这些令牌发生的位置。空格有时具有特殊含义(带有定义列表)。所以不,这一点都不容易,IMO。因此,如果这只是让您熟悉ANTLR(或一般的解析器生成器)的练习,我高度建议选择其他内容进行解析。
罗宾写道:
有人可以指出我的错误,也许可以给我一个如何匹配常规文本的提示吗?
您必须首先意识到ANTLR会创建词法分析器(tokenizer)和解析器。 Lexer规则以大写字母开头,解析器规则以小写字母开头。解析器只能对令牌(词法分析器规则生成的对象)进行操作。为了保持秩序, 不应在解析器规则中使用标记文字(请参阅下面语法中的规则q
)。此外,~
(否定)元字符具有不同的含义,具体取决于它的使用位置(在解析器或词法分析器规则中)。
采用以下语法:
p : T;
q : ~'z';
T : ~'x';
U : 'y';
ANTLR会首先将'z'
文字“移动”到词法分析器规则,如下所示:
p : T;
q : ~RANDOM_NAME;
T : ~'x';
U : 'y';
RANDOM_NAME : 'z';
(名称RANDOM_NAME
未使用,但无关紧要)。现在,解析器规则q
不匹配'z'
以外的任何字符!解析器规则中的否定否定了令牌(或词法分析器规则)。因此,~RANDOM_NAME
将匹配词法分析器T
或词法分析器U
。
在lexer规则中,~
否定(单个!)字符。因此,词法分析器规则T
将匹配范围\u0000
.. \uFFFF
中的任何字符,但'x'
除外。请注意以下内容:~'ab'
在词法分析器规则中无效:您只能否定单个字符集。
因此,解析器规则中的所有这些~'???'
都是错误的(错误,因为:它们的行为与您期望的不同)。
罗宾写道:
有没有办法为语法规则设置优先级?也许这可能是一个领先。
是的,在词法分析器和解析器规则中,顺序是从上到下(顶部具有最高优先级)。假设parse
是你语法的入口点:
parse
: p
| q
;
然后首先尝试p
,如果失败,则尝试匹配q
。
对于词法分析器规则,例如关键字规则在可能与所述关键字匹配的规则之前匹配:
// first keywords:
WHILE : 'while';
IF : 'if'
ELSE : 'else';
// and only then, the identifier rule:
ID : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;