我的令牌类
public class Token {
public enum TokenType {
RELATIONALOPERATOR("==|<>|<=|>=|>|<"), MULTIPLYINGOPERATOR("[*/]"), SIGNADDINGOP("[+-]"), LEFTPAREN(
"\\("), RIGHTPAREN("\\)"), COMMA(","), PEROID("\\."), ASSIGNMENTOP("="), SEMICOLON(";"), WHILE(
"while"), IF("if"), ELSE("else"), COMMENT("//"), PUBLIC("public"), PRIVATE("private"), PACKAGE(
"package"), IMPORT("import"), ENUM("enum"), CONSTANT(
"[0-9]*"), VARIABLE("[a-zA-Z][a-zA-Z0-9]*"), SKIP("[\\s+\\t]*"), INVALID(".*");
public final String pattern;
private TokenType(String pattern) {
this.pattern = pattern;
}
}
public TokenType type;
public String data;
public Token(TokenType type, String data) {
this.type = type;
this.data = data;
}
@Override
public String toString() {
return String.format("[ %s, %s ]", type.name(), this.data);
}
}
我的词法分析器类:
public static ArrayList<Token> lex(String input) {
// The tokens to return
ArrayList<Token> tokens = new ArrayList<Token>();
StringBuffer tokenPatternsBuffer = new StringBuffer();
for (TokenType tokenType : TokenType.values()) {
// format everything to match |(?<EXAMPLE> [0-9]*)
tokenPatternsBuffer.append(String.format("|(?<%s>%s)", tokenType.name(), tokenType.pattern));
}
Pattern tokenPatterns = Pattern.compile(tokenPatternsBuffer.substring(1));
//System.out.println(tokenPatternsBuffer.substring(1));
// Object that finds matches of pattern tokenPatterns
Matcher matcher = tokenPatterns.matcher(input);
while (matcher.find()) {
int i = 0;
System.out.println(matcher.group());
for (TokenType tk : TokenType.values()) {
// don't want to grab spaces
if (matcher.group(TokenType.SKIP.toString()) != null) {
continue;
}
// grab anything that isn't a space and add TokenType to the
// matcher group using .named() because the text matching
// exactly is vital
else if (matcher.group(tk.name()) != null) {
tokens.add(new Token(tk, matcher.group(tk.name())));
i++;
continue;
}
}
}
return tokens;
}
static String readFile(String path) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded);
}
public static void main(String[] args) throws IOException {
String toDebugg = readFile("Test.java");
ArrayList<Token> myTokens = lex(toDebugg);
// System.out.println(toDebugg);
// for (Token tok : myTokens) {
// System.out.println(tok);
// }
}
我已将问题缩小到while(matcher.find())
循环。在循环内部打印matcher.group()
不会产生任何结果。当我在上面的主要方法中运行代码时,我会收到1份[PACKAGE,package]的打印件,然后打印40份&#34; [CONSTANT,]&#34;而匹配的下一个标记应为&#34; [Variable,myLex]&#34; (包名)我正在测试我的词法分析器的文件叫做Test.java,它的内容是
package myLex;
public class Test {
public static void main(String[] args) {
}
}
我希望它以[TYPENAMEHERE,actualRegexMatchHere]的形式打印类似[PUBLIC,public],[CLASS,class]等的内容,直到它读完整个文件。感谢Erwins的建议,特殊字符现在已被转义,但问题仍然存在。
答案 0 :(得分:0)
最初我的正则表达式定义为
CONSTANT("[0-9]*"), VARIABLE("[a-zA-Z][a-zA-Z0-9]*"), SKIP("[\\s+\\t]*"), INVALID(".*");
[0-9]*
会匹配任何内容。用正则表达式替换所有*和+修复了问题。