我已经挪用并修改了下面的代码,它可以很好地使用Java的StreamTokenizer来标记Java代码。但它的数字处理存在问题:
我很乐意完全关闭StreamTokenizer的数字解析并自己从单词标记中解析数字,但是对st.parseNumbers()的评论似乎没有效果。
public class JavaTokenizer {
private String code;
private List<Token> tokens;
public JavaTokenizer(String c) {
code = c;
tokens = new ArrayList<>();
}
public void tokenize() {
try {
// Create the tokenizer
StringReader sr = new StringReader(code);
StreamTokenizer st = new StreamTokenizer(sr);
// Java-style tokenizing rules
st.parseNumbers();
st.wordChars('_', '_');
st.eolIsSignificant(false);
// Don't want whitespace tokens
//st.ordinaryChars(0, ' ');
// Strip out comments
st.slashSlashComments(true);
st.slashStarComments(true);
// Parse the file
int token;
do {
token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_NUMBER:
// A number was found; the value is in nval
double num = st.nval;
if(num % 1 == 0)
tokens.add(new IntegerToken((int)num);
else
tokens.add(new FPNumberToken(num));
break;
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
tokens.add(new WordToken(word));
break;
case '"':
// A double-quoted string was found; sval contains the contents
String dquoteVal = st.sval;
tokens.add(new DoubleQuotedStringToken(dquoteVal));
break;
case '\'':
// A single-quoted string was found; sval contains the contents
String squoteVal = st.sval;
tokens.add(new SingleQuotedStringToken(squoteVal));
break;
case StreamTokenizer.TT_EOL:
// End of line character found
tokens.add(new EOLToken());
break;
case StreamTokenizer.TT_EOF:
// End of file has been reached
tokens. add(new EOFToken());
break;
default:
// A regular character was found; the value is the token itself
char ch = (char) st.ttype;
if(Character.isWhitespace(ch))
tokens.add(new WhitespaceToken(ch));
else
tokens.add(new SymbolToken(ch));
break;
}
} while (token != StreamTokenizer.TT_EOF);
sr.close();
} catch (IOException e) {
}
}
public List<Token> getTokens() {
return tokens;
}
}
答案 0 :(得分:1)
&#34; on&#34;中的parseNumbers()默认情况下。使用resetSyntax()关闭数字解析和所有其他预定义字符类型,然后启用您需要的内容。
也就是说,手动数字解析可能会因为计算点和指数而变得棘手......使用扫描仪和正则表达式,实现自己的标记器应该相对简单,完全根据您的需要进行定制。例如,您可能需要查看Tokenizer
内部类:https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java(最后约120个LOC)
答案 1 :(得分:0)
当我有机会时,我会调查一下。与此同时,我实施的令人作呕的解决方法是:
boost::asio::socket_base::send_buffer_size option(1048576);
_socket.set_option(option);
然后在tokenize()
中private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";
在标记化循环中
//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\\s)\\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");
此解决方案特定于我不关心原始空白的需求(因为它在插入的“标记”周围添加了一些)