Question

我在C ++中构建代码解释器，当我使用整个令牌逻辑时，我遇到了一个意想不到的问题。

用户将一个字符串输入控制台，程序将所述字符串解析为不同的对象类型Token，问题是我这样做的方式如下：

void splitLine(string aLine) {

    stringstream ss(aLine);
    string stringToken, outp;
    char delim = ' ';

    // Break input string aLine into tokens and store them in rTokenBag
    while (getline(ss, stringToken, delim)) { 

        // assing value of stringToken parsed to t, this labes invalid tokens
        Token t (readToken(stringToken)); 

        R_Tokens.push_back(t);
    }   
}

这里的问题是，如果解析收到一个字符串，比如Hello World!，它会将其拆分为2个令牌Hello和World!

主要目标是让代码将双引号识别为字符串标记的开头并将其整体（从"存储到"）作为单个标记。因此，如果我输入x = "hello world"，它会将x存储为令牌，然后下次运行=作为令牌，然后hello world作为令牌而不是将其拆分

Answer 1

您可以使用C ++ 14 quoted操纵器。

#include <string>
#include <sstream>
#include <iomanip>

#include <iostream>

void splitLine(std::string aLine) {

    std::istringstream iss(aLine);
    std::string stringToken;

    // Break input string aLine into tokens and store them in rTokenBag
    while(iss >> std::quoted(stringToken)) {
        std::cout << stringToken << "\n";
    }
}

int main() {

    splitLine("Heloo world \"single token\" new tokens");
}

Answer 2

您真的不想通过在分隔符处拆分来对编程语言进行标记。

正确的标记生成器将打开第一个字符以决定要读取的标记类型，然后只要找到符合该标记类型的字符就继续读取，然后在找到第一个不匹配的字符时发出该标记（然后将其用作下一个标记的起点。）

这可能看起来像这样（假设it是istreambuf_iterator或其他迭代器，它逐个字符地迭代输入：

Token Tokenizer::next_token() {
    if (isalpha(*it)) {
        return read_identifier();
    } else if(isdigit(*it)) {
        return read_number();
    } else if(*it == '"') {
        return read_string();
    } /* ... */
}

Token Tokenizer::read_string() {
    // This should only be called when the current character is a "
    assert(*it == '"');
    it++;
    string contents;
    while(*it != '"') {
        contents.push_back(*it);
        it++;
    }
    return Token(TokenKind::StringToken, contents);
}

这不能处理的是转义序列或我们到达文件末尾而没有看到第二个"的情况，但它应该给你基本的想法。

像std::quoted这样的东西可能会解决您对字符串文字的直接问题，但如果您希望x="hello world"以与x = "hello world"（您几乎相同）的方式进行标记，则无法帮助您当然可以。）

PS：您还可以先将整个源读入内存，然后让令牌包含指向源的索引或指针而不是字符串（因此，您只需保存起始索引，而不是contents变量。在循环之前然后返回Token(TokenKind::StringToken, start_index, current_index)）。哪一个更好，部分取决于你在解析器中做了什么。如果您的解析器直接生成结果，并且您不需要在处理它们之后保留令牌，那么第一个解析器的内存效率更高，因为您永远不需要将整个源保存在内存中。如果你创建一个AST，内存消耗将大致相同，但第二个版本将允许你有一个大字符串而不是许多小字符串。

Answer 3

所以我终于明白了，我可以使用getline（）来实现我的目标。

这个新代码运行并解析我需要它的方式：

    void splitLine(string aLine) {

    stringstream ss(aLine);
    string stringToken, outp;
    char delim = ' ';

    while (getline(ss, stringToken, delim)) { // Break line into tokens and store them in rTokenBag

        //new code starts here
        // if the current parse sub string starts with double quotes
        if (stringToken[0] == '"' ) { 

            string torzen;
            // parse me the rest of ss until you find another double quotes
            getline(ss, torzen, '"' ); 

           // Give back the space cut form the initial getline(), add the parsed sub string from the second getline(), and add a double quote at the end that was cut by the second getline()
            stringToken += ' ' + torzen + '"'; 

        }
        // And we can all continue with our lives 
        Token t (readToken(stringToken)); // assing value of stringToken parsed to t, this labes invalid tokens

        R_Tokens.push_back(t);

    }


}

感谢所有回答和评论的人，您的帮助非常大！

在解释器开发期间解析令牌问题

3 个答案: