我正在尝试将Bart Kiers的ANTLR PCRE语法(请参阅:http://big-o.nl/apps/pcreparser/pcre/PCREParser.html)构建到JS目标。我得到它的唯一方法是使用全局回溯和memoization,它生成的代码在这里是无效的语法:
grammar PCRE;
options {
language=JavaScript;
backtrack=true;
memoize=true;
}
parse
: regexAtom* EOF
;
... and the rest of the grammar as seen: http://big-o.nl/apps/pcreparser/pcre/PCREParser.html
词法分析器生成的代码如下:
//$ANTLR 3.4 PCRE.g 2011-11-19 23:22:35
var PCRELexer = function(input, state) {
// alternate constructor @todo
// public PCRELexer(CharStream input)
// public PCRELexer(CharStream input, RecognizerSharedState state) {
if (!state) {
state = new org.antlr.runtime.RecognizerSharedState();
}
(function(){
}).call(this);
PCRELexer.superclass.constructor.call(this, input, state);
};
org.antlr.lang.augmentObject(PCRELexer, {
: ,
: ,
: ,
: ,
: ,
... and more of this empty object.
当我尝试使用这个生成的代码时,我在上面的arguementObject行上遇到了JS错误。有人可以提供我如何为JS目标正确构建它的方向。我将最终构建一个walker,为同一页面上显示的输出生成类似的输出。
答案 0 :(得分:6)
更新:生成JS词法分析器和解析器文件的语法可以在这里找到:https://github.com/bkiers/PCREParser/tree/js
我有一段时间没有更新该页面(如果我找到时间,我会尽快更新)。这是不使用全局回溯的修改后的语法,并且使用JavaScript(使用ANTLR v3.3测试)可以工作(据我所知):
grammar PCRE;
options {
output=AST;
language=JavaScript;
}
tokens {
REGEX;
ATOM;
DOT;
OR;
CHAR_CLASS;
NEG_CHAR_CLASS;
RANGE;
QUOTATION;
INT;
QUANTIFIER;
GREEDY;
RELUCTANT;
POSSESSIVE;
BACK_REFERENCE;
CAPTURE_GROUP;
FLAG_GROUP;
ATOMIC_GROUP;
NON_CAPTURE_GROUP;
POSITIVE_LOOK_AHEAD;
NEGATIVE_LOOK_AHEAD;
POSITIVE_LOOK_BEHIND;
NEGATIVE_LOOK_BEHIND;
FLAGS;
ENABLE;
DISABLE;
DOT;
ATOM;
ATOMS;
}
// parser rules
parse
: regexAtoms EOF -> ^(REGEX regexAtoms)
;
regexAtoms
: atoms (Or^ atoms)*
;
atoms
: regexAtom* -> ^(ATOMS regexAtom*)
;
regexAtom
: unit quantifier? -> ^(ATOM unit quantifier?)
;
unit
: charClass
| singleChar
| boundaryMatch
| quotation
| backReference
| group
| ShorthandCharacterClass
| PosixCharacterClass
| Dot
;
quantifier
: (greedy -> ^(GREEDY greedy))
('+' -> ^(POSSESSIVE greedy)
|'?' -> ^(RELUCTANT greedy)
)?
;
greedy
: '+' -> INT["1"] INT["2147483647"]
| '*' -> INT["0"] INT["2147483647"]
| '?' -> INT["0"] INT["1"]
| '{' (a=integer -> INT[$a.text] INT[$a.text])
(
(',' -> INT[$a.text] INT["2147483647"])
(b=integer -> INT[$a.text] INT[$b.text])?
)?
'}'
;
charClass
: '[' (('^')=> '^' charClassAtom+ ']' -> ^(NEG_CHAR_CLASS charClassAtom+)
| charClassAtom+ ']' -> ^(CHAR_CLASS charClassAtom+)
)
;
charClassAtom
: (charClassSingleChar '-' charClassSingleChar)=>
charClassSingleChar '-' charClassSingleChar -> ^(RANGE charClassSingleChar charClassSingleChar)
| quotation
| ShorthandCharacterClass
| BoundaryMatch
| PosixCharacterClass
| charClassSingleChar
;
charClassSingleChar
: charClassEscape
| EscapeSequence
| OctalNumber
| SmallHexNumber
| UnicodeChar
| Or
| Caret
| Hyphen
| Colon
| Dollar
| SquareBracketStart
| RoundBracketStart
| RoundBracketEnd
| CurlyBracketStart
| CurlyBracketEnd
| Equals
| LessThan
| GreaterThan
| ExclamationMark
| Comma
| Plus
| Star
| QuestionMark
| Dot
| Digit
| OtherChar
;
charClassEscape
: '\\' ('\\' | '^' | ']' | '-')
;
singleChar
: regexEscape
| EscapeSequence
| OctalNumber
| SmallHexNumber
| UnicodeChar
| Hyphen
| Colon
| SquareBracketEnd
| CurlyBracketEnd
| Equals
| LessThan
| GreaterThan
| ExclamationMark
| Comma
| Digit
| OtherChar
;
regexEscape
: '\\' ('\\' | '|' | '^' | '$' | '[' | '(' | ')' | '{' | '}' | '+' | '*' | '?' | '.')
;
boundaryMatch
: Caret
| Dollar
| BoundaryMatch
;
backReference
: '\\' integer -> ^(BACK_REFERENCE integer)
;
group
: '('
( '?' ( (flags -> ^(FLAG_GROUP flags)
)
(':' regexAtoms -> ^(NON_CAPTURE_GROUP flags regexAtoms)
)?
| '>' regexAtoms -> ^(ATOMIC_GROUP regexAtoms)
| '!' regexAtoms -> ^(NEGATIVE_LOOK_AHEAD regexAtoms)
| '=' regexAtoms -> ^(POSITIVE_LOOK_AHEAD regexAtoms)
| '<' ( '!' regexAtoms -> ^(NEGATIVE_LOOK_BEHIND regexAtoms)
| '=' regexAtoms -> ^(POSITIVE_LOOK_BEHIND regexAtoms)
)
)
| regexAtoms -> ^(CAPTURE_GROUP regexAtoms)
)
')'
;
flags
: (a=singleFlags -> ^(FLAGS ^(ENABLE $a)))
('-' b=singleFlags -> ^(FLAGS ^(ENABLE $a) ^(DISABLE $b))
)?
;
singleFlags
: OtherChar*
;
quotation
: QuotationStart innerQuotation QuotationEnd -> ^(QUOTATION innerQuotation)
;
innerQuotation
: (~QuotationEnd)*
;
integer
: (options{greedy=true;}: Digit)+
;
// lexer rules
QuotationStart
: '\\Q'
;
QuotationEnd
: '\\E'
;
PosixCharacterClass
: '\\p{' ('Lower' | 'Upper' | 'ASCII' | 'Alpha' | 'Digit' | 'Alnum' | 'Punct' | 'Graph' | 'Print' | 'Blank' | 'Cntrl' | 'XDigit' | 'Space') '}'
;
ShorthandCharacterClass
: Escape ('d' | 'D' | 's' | 'S' | 'w' | 'W')
;
BoundaryMatch
: Escape ('b' | 'B' | 'A' | 'Z' | 'z' | 'G')
;
OctalNumber
: Escape '0' ( OctDigit? OctDigit
| '0'..'3' OctDigit OctDigit
)
;
SmallHexNumber
: Escape 'x' HexDigit HexDigit
;
U nicodeChar
: Escape 'u' HexDigit HexDigit HexDigit HexDigit
;
EscapeSequence
: Escape ('t' | 'n' | 'r' | 'f' | 'a' | 'e' | ~('a'..'z' | 'A'..'Z' | '0'..'9'))
;
Escape : '\\';
Or : '|';
Hyphen : '-';
Caret : '^';
Colon : ':';
Dollar : '$';
SquareBracketStart : '[';
SquareBracketEnd : ']';
RoundBracketStart : '(';
RoundBracketEnd : ')';
CurlyBracketStart : '{';
CurlyBracketEnd : '}';
Equals : '=';
LessThan : '<';
GreaterThan : '>';
ExclamationMark : '!';
Comma : ',';
Plus : '+';
Star : '*';
QuestionMark : '?';
Dot : '.';
Digit : '0'..'9';
OtherChar : . ;
// fragments
fragment OctDigit : '0'..'7';
fragment HexDigit : ('0'..'9' | 'a'..'f' | 'A'..'F');
它包含没有目标特定代码的旁边。我唯一做的就是使用几个字符串文字来重写AST(参见量词)和几个.text
调用,但几乎所有的ANTLR目标都接受双引号字符串文字和.text
,所以你应该这样做可以使用Java,Python,C和JavaScript。对于C#,我猜您需要将.text
次调用更改为.Text
。
您可以使用以下HTML文件对其进行测试:
<html>
<head>
<script type="text/javascript" src="antlr3-all-min.js"></script>
<script type="text/javascript" src="PCRELexer.js"></script>
<script type="text/javascript" src="PCREParser.js"></script>
<style type="text/css">
#tree {
padding: 20px;
font-family: Monospace;
}
.leaf {
font-weight: bold;
font-size: 130%;
}
</style>
<script type="text/javascript">
function init() {
document.getElementById("parse").onclick = parseRegex;
}
function parseRegex() {
document.getElementById("tree").innerHTML = "";
var regex = document.getElementById("regex").value;
if(regex) {
var lexer = new PCRELexer(new org.antlr.runtime.ANTLRStringStream(regex));
var parser = new PCREParser(new org.antlr.runtime.CommonTokenStream(lexer));
var root = parser.parse().getTree();
printTree(root, 0);
}
else {
document.getElementById("regex").value = "enter a regex here first";
}
}
function printTree(root, indent) {
if(!root) return;
for(var i = 0; i < indent; i++) {
document.getElementById("tree").innerHTML += ". ";
}
var n = root.getChildCount();
if(n == 0) {
document.getElementById("tree").innerHTML += "<span class=\"leaf\">" + root + "</span><br />";
}
else {
document.getElementById("tree").innerHTML += root + "<br />";
}
for(i = 0; i < n; i++) {
printTree(root.getChild(i), indent + 1);
}
}
</script>
</head>
<body onload="init()">
<input id="regex" type="text" size="50" />
<button id="parse">parse</button>
<div id="tree"></div>
</body>
</html>
(我很少使用JavaScript,所以不要介意上面发布的混乱!)
如果您现在正在解析正则表达式:
[^-234-7]|(?=[ab\]@]++$).|^$|\1\.\(
在上面的HTML文件的帮助下,您将看到以下内容被打印到屏幕上:
REGEX . | . . | . . . | . . . . ATOMS . . . . . ATOM . . . . . . NEG_CHAR_CLASS . . . . . . . - . . . . . . . 2 . . . . . . . 3 . . . . . . . RANGE . . . . . . . . 4 . . . . . . . . 7 . . . . ATOMS . . . . . ATOM . . . . . . POSITIVE_LOOK_AHEAD . . . . . . . ATOMS . . . . . . . . ATOM . . . . . . . . . CHAR_CLASS . . . . . . . . . . a . . . . . . . . . . b . . . . . . . . . . \] . . . . . . . . . . @ . . . . . . . . . POSSESSIVE . . . . . . . . . . 1 . . . . . . . . . . 2147483647 . . . . . . . . ATOM . . . . . . . . . $ . . . . . ATOM . . . . . . . . . . ATOMS . . . . ATOM . . . . . ^ . . . . ATOM . . . . . $ . . ATOMS . . . ATOM . . . . BACK_REFERENCE . . . . . 1 . . . ATOM . . . . \. . . . ATOM . . . . \(
小心,我还没有正确测试语法!如果您发现任何错误,请告诉我,我将不胜感激。
如果取消注释行language=JavaScript;
,请重新生成词法分析器和解析器并运行以下类:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "[^-234-7]|(?=[ab\\]@]++$).|^$|\\1\\.\\(|\\Q*+[\\E";
PCRELexer lexer = new PCRELexer(new ANTLRStringStream(src));
PCREParser parser = new PCREParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
您将看到对应于以下AST的DOT输出: