ANTLR4:在令牌规则中使用非ASCII字符

时间:2015-01-24 14:25:38

标签: unicode antlr token grammar antlr4

在ANTRL4书的第74页上,它说只要通过以这种方式指定其代码点,就可以在语法中使用任何Unicode字符:

'\uxxxx'

其中xxxx是Unicode代码点的十六进制值。

所以我在ID令牌的令牌规则中使用了这种技术:

grammar ID;

id : ID EOF ;

ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;

当我尝试解析此输入时:

Gŭnter

ANTLR抛出错误,说它无法识别ŭ。 (ŭ字符为十六进制016D,因此它在指定的范围内)

我做错了什么?

3 个答案:

答案 0 :(得分:9)

ANTLR已准备好接受16位字符,但默认情况下,许多语言环境将以字节(8位)读取字符。使用Java库从文件读取时,需要指定适当的编码。如果您使用TestRig,可能是通过别名/脚本grun,那么请使用参数-encoding utf-8或其他。如果查看该类的源代码,您将看到以下机制:

InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

答案 1 :(得分:0)

对于那些在Java代码中使用antlr4遇到相同问题的人,ANTLRInputStream被弃用,这是一种将多字符unicode数据从String传递到MyLexer的有效方法词法分析器:

    String myString = "\u2013";

    CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
    CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
    CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);

    OneLexer lexer = new MyLexer(cpcs);       
    CommonTokenStream tokens = new CommonTokenStream(lexer);

答案 2 :(得分:0)

语法:

NAME:
   [A-Za-z][0-9A-Za-z\u0080-\uFFFF_]+
;

Java:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStream;

import com.thalesgroup.dms.stimulus.StimulusParser.SystemContext;

final class RequirementParser {

   static SystemContext parse( String requirement ) {
      requirement = requirement.replaceAll( "\t", "   " );
      final CharStream     charStream = CharStreams.fromString( requirement );
      final StimulusLexer  lexer      = new StimulusLexer( charStream );
      final TokenStream    tokens     = new CommonTokenStream( lexer );
      final StimulusParser parser     = new StimulusParser( tokens );
      final SystemContext  system     = parser.system();
      if( parser.getNumberOfSyntaxErrors() > 0 ) {
         Debug.format( requirement );
      }
      return system;
   }

   private RequirementParser() {/**/}
}

来源:

Lexers and Unicode text