使用Antlr解析具有多个语言环境的公式

时间:2016-03-16 19:13:33

标签: antlr antlr4cs

我对Antlr很新,所以原谅可能是一个非常简单的问题。

我正在创建一个解析类似Excel的公式的语法,它需要支持基于列表分隔符(对于en-US)和小数点分隔符(。对于en-US)的多个语言环境。我宁愿不根据语言环境在单独的语法之间进行选择。

我可以修改或继承CommonTokenStream类来完成此任务,还是有其他方法可以做到这一点?例子会有所帮助。

我在VS2015 C#项目中使用Antlr v4.5.0-alpha003 NuGet包。

2 个答案:

答案 0 :(得分:0)

您可以做的是向词法分析器添加语言环境(或自定义分隔符和分组字符),并在词法分析器规则之前添加语义谓词,以检查自定义分隔符和分组字符并动态匹配这些标记。

我没有在这里运行ANTLR和C#,但Java演示应该非常相似:

grammar LocaleDemo;

@lexer::header {
  import java.text.DecimalFormatSymbols;
  import java.util.Locale;
}

@lexer::members {

  private char decimalSeparator = '.';
  private char groupingSeparator = ',';

  public LocaleDemoLexer(CharStream input, Locale locale) {
    this(input);
    DecimalFormatSymbols dfs = new DecimalFormatSymbols(locale);
    this.decimalSeparator = dfs.getDecimalSeparator();
    this.groupingSeparator = dfs.getGroupingSeparator();
  }
}

parse
 : .*? EOF
 ;

NUMBER
 : D D? ( DG D D D )* ( DS D+ )?
 ;

OTHER
 : .
 ;

fragment D  : [0-9];
fragment DS : {_input.LA(1) == decimalSeparator}?  . ;
fragment DG : {_input.LA(1) == groupingSeparator}? . ;

要测试上面的语法,请运行此类:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;
import java.util.Locale;

public class Main {

    private static void tokenize(String input, Locale locale) {

        LocaleDemoLexer lexer = new LocaleDemoLexer(new ANTLRInputStream(input), locale);
        System.out.printf("\ninput='%s', locale=%s, tokens:\n", input, locale);

        for (Token t : lexer.getAllTokens()) {
            System.out.printf("  %-10s '%s'\n", LocaleDemoLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
        }
    }

    public static void main(String[] args) throws Exception {

        tokenize("1.23", Locale.ENGLISH);
        tokenize("1.23", Locale.GERMAN);

        tokenize("12.345.678,90", Locale.ENGLISH);
        tokenize("12.345.678,90", Locale.GERMAN);
    }
}

会打印:

input='1.23', locale=en, tokens:
  NUMBER     '1.23'

input='1.23', locale=de, tokens:
  NUMBER     '1'
  OTHER      '.'
  NUMBER     '23'

input='12.345.678,90', locale=en, tokens:
  NUMBER     '12.345'
  OTHER      '.'
  NUMBER     '67'
  NUMBER     '8'
  OTHER      ','
  NUMBER     '90'

input='12.345.678,90', locale=de, tokens:
  NUMBER     '12.345.678,90'

相关Q& A&#39>:

答案 1 :(得分:0)

作为Bart回答的后续行动,这是我根据他的建议创建的语法:

grammar ExcelScript;



@lexer::header
{
using System;
using System.Globalization;
}

@lexer::members
{
    private Int32 listseparator = 44; // UTF16 value for comma
    private Int32 decimalseparator = 46; // UTF16 value for period

    /// <summary>
    /// Creates a new lexer object
    /// </summary>
    /// <param name="input">The input stream</param>
    /// <param name="locale">The locale to use in parsing numbers</param>
    /// <returns>A new lexer object</returns>
    public ExcelScriptLexer (ICharStream input, CultureInfo locale)
    : this(input)
    {
        this.listseparator = Convert.ToInt32(locale.TextInfo.ListSeparator[0]);
        this.decimalseparator = Convert.ToInt32(locale.NumberFormat.NumberDecimalSeparator[0]);

        // special case for 8 locales where the list separator is a , and the number separator is a , too
        // Excel uses semicolon for list separator, so we will too
        if (this.listseparator == 44 && this.decimalseparator == 44)
            this.listseparator = 59; // UTF16 value for semicolon
    }
}


/*
 * Parser Rules
 */

formula
    :   numberLiteral
    |   Identifier
    |   '=' expression
    ;

expression
    :   primary                                     # PrimaryExpression
    |   Identifier arguments                                # FunctionCallExpression
    |   ('+' | '-') expression                              # UnarySignExpression
    |   expression ('*' | '/' | '%') expression                     # MulDivModExpression
    |   expression ('+' | '-') expression                       # AddSubExpression
    |   expression ('<=' | '>=' | '>' | '<') expression                 # CompareExpression
    |   expression ('=' | '<>') expression                      # EqualCompareExpression
    ;

primary
    :   '(' expression ')'                              # ParenExpression
    |   literal                                     # LiteralExpression
    |   Identifier                                  # IdentifierExpression
    ;

literal
    :   numberLiteral                                   # NumberLiteralRule
    |   booleanLiteral                                  # BooleanLiteralRule
    ;

numberLiteral
    :   IntegerLiteral
    |   FloatingPointLiteral
    ;

booleanLiteral
    :   TrueKeyword
    |   FalseKeyword
    ;

arguments
    :   '(' expressionList? ')'
    ;

expressionList
    :   expression (ListSeparator expression)*
    ;

/*
 * Lexer Rules
 */

AddOperator :   '+' ;
SubOperator :   '-' ;
MulOperator :   '*' ;
DivOperator :   '/' ;
PowOperator :   '^' ;
EqOperator  :   '=' ;
NeqOperator :   '<>' ;
LeOperator  :   '<=' ;
GeOperator  :   '>=' ;
LtOperator  :   '<' ;
GtOperator  :   '>' ;

ListSeparator : {_input.La(1) == listseparator}? . ;
DecimalSeparator : {_input.La(1) == decimalseparator}? . ;

TrueKeyword :   [Tt][Rr][Uu][Ee] ;
FalseKeyword    :   [Ff][Aa][Ll][Ss][Ee] ;

Identifier
    :   Letter (Letter | Digit)*
    ;

fragment Letter
    :   [A-Z_a-z]
    ;

fragment Digit
    :   [0-9]
    ;

IntegerLiteral
    :   '0'
    |   [1-9] [0-9]*
    ;

FloatingPointLiteral
    :   [0-9]+ DecimalSeparator [0-9]* Exponent?
    |   DecimalSeparator [0-9]+ Exponent?
    |   [0-9]+ Exponent
    ;

fragment Exponent
    :   ('e' | 'E') ('+' | '-')? ('0'..'9')+
    ;

WhiteSpace
    :   [ \t]+ -> channel(HIDDEN)
    ;