StreamTokenizer会破坏整数和松散的周期

时间:2017-04-19 17:40:35

标签: java tokenize

我已经挪用并修改了下面的代码,它可以很好地使用Java的StreamTokenizer来标记Java代码。但它的数字处理存在问题:

  1. 它将所有整数转换为双精度。通过测试num%1 == 0,我可以通过它,但这感觉就像一个黑客
  2. 更关键的是,a。以下空格被视为数字。 “Class .method()”是合法的Java语法,但生成的标记是[Word“Class”],[Whitespace“”],[Number 0.0],[Word“method”],[Symbol“(”]和[符号“)”]
  3. 我很乐意完全关闭StreamTokenizer的数字解析并自己从单词标记中解析数字,但是对st.parseNumbers()的评论似乎没有效果。

    public class JavaTokenizer {
    
    private String code;
    
    private List<Token> tokens;
    
    public JavaTokenizer(String c) {
        code = c;
        tokens = new ArrayList<>();
    }
    
    public void tokenize() {
        try {
            // Create the tokenizer
            StringReader sr = new StringReader(code);
            StreamTokenizer st = new StreamTokenizer(sr);
    
            // Java-style tokenizing rules
            st.parseNumbers();
            st.wordChars('_', '_');
            st.eolIsSignificant(false);
    
            // Don't want whitespace tokens
            //st.ordinaryChars(0, ' ');
    
            // Strip out comments
            st.slashSlashComments(true);
            st.slashStarComments(true);
    
            // Parse the file
            int token;
            do {
                token = st.nextToken();
                switch (token) {
                case StreamTokenizer.TT_NUMBER:
                    // A number was found; the value is in nval
                    double num = st.nval;
                    if(num % 1 == 0)
                      tokens.add(new IntegerToken((int)num);
                    else
                      tokens.add(new FPNumberToken(num));
                    break;
                case StreamTokenizer.TT_WORD:
                    // A word was found; the value is in sval
                    String word = st.sval;
                    tokens.add(new WordToken(word));
                    break;
                case '"':
                    // A double-quoted string was found; sval contains the contents
                    String dquoteVal = st.sval;
                    tokens.add(new DoubleQuotedStringToken(dquoteVal));
                    break;
                case '\'':
                    // A single-quoted string was found; sval contains the contents
                    String squoteVal = st.sval;
                    tokens.add(new SingleQuotedStringToken(squoteVal));
                    break;
                case StreamTokenizer.TT_EOL:
                    // End of line character found
                    tokens.add(new EOLToken());
                    break;
                case StreamTokenizer.TT_EOF:
                    // End of file has been reached
                    tokens. add(new EOFToken());
                    break;
                default:
                    // A regular character was found; the value is the token itself
                    char ch = (char) st.ttype;
                    if(Character.isWhitespace(ch))
                        tokens.add(new WhitespaceToken(ch));
                    else
                        tokens.add(new SymbolToken(ch));
                    break;
                }
            } while (token != StreamTokenizer.TT_EOF);
            sr.close();
        } catch (IOException e) {
        }
    }
    
    public List<Token> getTokens() {
        return tokens;
    }
    
    }
    

2 个答案:

答案 0 :(得分:1)

&#34; on&#34;中的parseNumbers()默认情况下。使用resetSyntax()关闭数字解析和所有其他预定义字符类型,然后启用您需要的内容。

也就是说,手动数字解析可能会因为计算点和指数而变得棘手......使用扫描仪和正则表达式,实现自己的标记器应该相对简单,完全根据您的需要进行定制。例如,您可能需要查看Tokenizer内部类:https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java(最后约120个LOC)

答案 1 :(得分:0)

当我有机会时,我会调查一下。与此同时,我实施的令人作呕的解决方法是:

    boost::asio::socket_base::send_buffer_size option(1048576);
    _socket.set_option(option);

然后在tokenize()

private static final String DANGLING_PERIOD_TOKEN = "___DANGLING_PERIOD_TOKEN___";

在标记化循环中

//a period following whitespace, not followed by a digit is a "dangling period"
code = code.replaceAll("(?<=\\s)\\.(?![0-9])", " "+DANGLING_PERIOD_TOKEN+" ");

此解决方案特定于我不关心原始空白的需求(因为它在插入的“标记”周围添加了一些)