Java语法分析器

时间:2018-02-13 23:39:01

标签: java regex tokenize

我的令牌类

public class Token {
public enum TokenType {
RELATIONALOPERATOR("==|<>|<=|>=|>|<"), MULTIPLYINGOPERATOR("[*/]"), SIGNADDINGOP("[+-]"), LEFTPAREN(
    "\\("), RIGHTPAREN("\\)"), COMMA(","), PEROID("\\."), ASSIGNMENTOP("="), SEMICOLON(";"), WHILE(
        "while"), IF("if"), ELSE("else"), COMMENT("//"), PUBLIC("public"), PRIVATE("private"), PACKAGE(
            "package"), IMPORT("import"), ENUM("enum"), CONSTANT(
                "[0-9]*"), VARIABLE("[a-zA-Z][a-zA-Z0-9]*"), SKIP("[\\s+\\t]*"), INVALID(".*");

public final String pattern;

private TokenType(String pattern) {
    this.pattern = pattern;
}
}

public TokenType type;
public String data;

public Token(TokenType type, String data) {
  this.type = type;
  this.data = data;
}

@Override
public String toString() {
  return String.format("[ %s, %s ]", type.name(), this.data);
}
}

我的词法分析器类:

public static ArrayList<Token> lex(String input) {
// The tokens to return
ArrayList<Token> tokens = new ArrayList<Token>();

StringBuffer tokenPatternsBuffer = new StringBuffer();
for (TokenType tokenType : TokenType.values()) {
    // format everything to match |(?<EXAMPLE> [0-9]*)
    tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));

}
Pattern tokenPatterns = Pattern.compile(tokenPatternsBuffer.substring(1));
//System.out.println(tokenPatternsBuffer.substring(1));

// Object that finds matches of pattern tokenPatterns
Matcher matcher = tokenPatterns.matcher(input);
while (matcher.find()) {
    int i = 0;
    System.out.println(matcher.group());
    for (TokenType tk : TokenType.values()) {
    // don't want to grab spaces
    if (matcher.group(TokenType.SKIP.toString()) != null) {
        continue;
    }
    // grab anything that isn't a space and add TokenType to the
    // matcher group using .named() because the text matching
    // exactly is vital
    else if (matcher.group(tk.name()) != null) {
        tokens.add(new Token(tk, matcher.group(tk.name())));
        i++;
        continue;
    }

    }
}

return tokens;
}

static String readFile(String path) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded);
}

public static void main(String[] args) throws IOException {
String toDebugg = readFile("Test.java");
ArrayList<Token> myTokens = lex(toDebugg);
// System.out.println(toDebugg);
// for (Token tok : myTokens) {
// System.out.println(tok);
// }
}

我已将问题缩小到while(matcher.find())循环。在循环内部打印matcher.group()不会产生任何结果。当我在上面的主要方法中运行代码时,我会收到1份[PACKAGE,package]的打印件,然后打印40份&#34; [CONSTANT,]&#34;而匹配的下一个标记应为&#34; [Variable,myLex]&#34; (包名)我正在测试我的词法分析器的文件叫做Test.java,它的内容是

package myLex;
public class Test {

   public static void main(String[] args) {

   }

}

我希望它以[TYPENAMEHERE,actualRegexMatchHere]的形式打印类似[PUBLIC,public],[CLASS,class]等的内容,直到它读完整个文件。感谢Erwins的建议,特殊字符现在已被转义,但问题仍然存在。

1 个答案:

答案 0 :(得分:0)

最初我的正则表达式定义为
     CONSTANT("[0-9]*"), VARIABLE("[a-zA-Z][a-zA-Z0-9]*"), SKIP("[\\s+\\t]*"), INVALID(".*");

由于明星角色,

[0-9]*会匹配任何内容。用正则表达式替换所有*和+修复了问题。