在C中编写语法标记器/解析器的最佳方法是什么?

时间:2016-11-11 00:33:29

标签: c string parsing tokenize

背景资料:
我想要编写一种编程语言,知道这样做的工具,我没有任何关于如何使用它们的好例子。我真的不想使用Flex或Bison,因为它没有教导我认为创建编译器所需的抽象性。我有创建字符串,标记它们,将它们提供给充当语法的文件并解析最终创建实际程序来运行语言的概念。问题是,我不知道如何编写tokenizer或解析器。我有一般的想法,但是当我看到例子时,我会更清楚。如果有人可以发布一个/几个例子,那就太棒了!

我的问题如下:
有人可以发布如何在C中编写语法标记器/解析器的示例吗?

1 个答案:

答案 0 :(得分:2)

如果你想在C中编写一个非常复杂的语法分析器,而不使用任何现有的模式匹配代码,那么通常最好实现一个状态机,然后用char处理源代码char。

Flex + Bison的输出也只是一个状态机。 Flex使用正则表达式将字符串标记为令牌,然后将其传递给Bison状态机,根据计算机的当前状态处理一个接一个令牌。但是,您不需要正则表达式标记生成器,您可以将输入标记为状态机处理的一部分。正则表达式匹配器本身也可以实现为状态机,因此令牌生成可以直接作为状态机的一部分。

这是一个有趣的链接;它不是特别是C,更概括地概述了状态机是如何工作的,但是一旦你掌握了这个概念,很容易将它转换为C代码:

Parsing command line arguments using a finite state machine and backtracking

这是一个超级原始CSV解析器的示例代码:

#include <stdlib.h>
#include <stdio.h>

static char currentToken[4096];
static size_t currentTokenLength;

static
void addCharToCurrentToken ( char c ) {
    if (currentTokenLength < sizeof(currentToken)) {
        currentToken[currentTokenLength++] = c;
    }
}

static
void printCurrentToken ( ) {
    printf("Token: >>>%.*s<<<\n", (int)currentTokenLength, currentToken);
    currentTokenLength = 0;
}


typedef enum {
    STATE_FindStartOfData,
    STATE_FindStartOfToken,
    STATE_ParseNumber,
    STATE_ParseString,
    STATE_CheckEndOfString,
    STATE_FindDelimiter,
    STATE_ParseError,
    STATE_EndOfData
} ParserState;


ParserState parserState = STATE_FindStartOfData;


static
void runTheStateMachine ( ) {
    while (parserState != STATE_ParseError
            && parserState != STATE_EndOfData
    ) {
        int c = fgetc(stdin);
        // End of data?
        if (c == -1) {
            switch (parserState) {
                case STATE_ParseNumber:
                case STATE_CheckEndOfString:
                    printCurrentToken();
                    parserState = STATE_EndOfData;
                    break;

                case STATE_ParseString:
                    // Data ends in the middle of token parsing? No way!
                    fprintf(stderr, "Data ended abruptly!\n");
                    parserState = STATE_ParseError;
                    break;

                case STATE_FindStartOfData:
                case STATE_FindStartOfToken:
                case STATE_FindDelimiter:
                    // This is okay, data stream may end while in these states
                    parserState = STATE_EndOfData;
                    break;

                case STATE_ParseError:
                case STATE_EndOfData:
                    break;
            }
        }

        switch (parserState) {
                case STATE_FindStartOfData:
                    // Skip blank lines
                    if (c == '\n' || c == '\r') break;
                    // !!!FALLTHROUGH!!!

                case STATE_FindStartOfToken:
                    // Skip overe all whitespace
                    if (c == ' ' || c == '\t') break;
                    // Start of string?
                    if (c == '"') {
                        parserState = STATE_ParseString;
                        break;
                    }
                    // Blank field?
                    if (c == ',') {
                        printCurrentToken();
                        break;
                    }
                    // End of dataset?
                    if (c == '\n' || c == '\r') {
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Everything else can only be a number
                    parserState = STATE_ParseNumber;
                    addCharToCurrentToken(c);
                    break;

                case STATE_ParseNumber:
                    if (c == ' ' || c == '\t') {
                        // Numbers cannot contain spaces in the middle,
                        // so this must be the end of the number.
                        printCurrentToken();
                        // We still need to find the real delimiter, though.
                        parserState = STATE_FindDelimiter;
                        break;
                    }
                    if (c == ',') {
                        // This time the number ends directly with a delimiter
                        printCurrentToken();
                        parserState = STATE_FindStartOfToken;
                        break;
                    }
                    // End of dataset?
                    if (c == '\n' || c == '\r') {
                        printCurrentToken();
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Otherwise keep reading the number
                    addCharToCurrentToken(c);
                    break;

                case STATE_ParseString:
                    if (c == '"') {
                        // Either this is the regular end of the string or it is just an
                        // escaped quotation mark which is doubled ("") in CVS.
                        parserState = STATE_CheckEndOfString;
                        break;
                    }
                    // All other chars are just treated as ordinary chars
                    addCharToCurrentToken(c);
                    break;

                case STATE_CheckEndOfString:
                    if (c == '"') {
                        // Next char is also a quotation mark,
                        // so this was not the end of the string.
                        addCharToCurrentToken(c);
                        parserState = STATE_ParseString;
                        break;
                    }
                    if (c == ' ' || c == '\t') {
                        // It was the end of the string
                        printCurrentToken();
                        // We still need to find the real delimiter, though.
                        parserState = STATE_FindDelimiter;
                        break;
                    }
                    if (c == ',') {
                        // It was the end of the string
                        printCurrentToken();
                        // And we even found the delimiter
                        parserState = STATE_FindStartOfToken;
                        break;
                    }
                    if (c == '\n' || c == '\r') {
                        // It was the end of the string
                        printCurrentToken();
                        // And we even found the end of this dataset
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Everything else is a parse error I guess
                    fprintf(stderr, "Unexpected char 0x%02X after end of string!\n", c);
                    parserState = STATE_ParseError;
                    break;

                case STATE_FindDelimiter:
                    // Delemiter found?
                    if (c == ',') {
                        parserState = STATE_FindStartOfToken;
                        break;
                    }
                    // Just skip overe all whitespace
                    if (c == ' ' || c == '\t') break;
                    // End of dataset?
                    if (c == '\n' || c == '\r') {
                        // And we even found the end of this dataset
                        printf("------------------------------------------\n");
                        parserState = STATE_FindStartOfData;
                        break;
                    }
                    // Anything else a pare error I guess
                    fprintf(stderr, "Unexpected char 0x%02X after end of token!\n", c);
                    parserState = STATE_ParseError;
                    break;

                case STATE_ParseError:
                    // Nothing to do
                    break;

                case STATE_EndOfData:
                    // Nothing to do
                    break;
        }
    }
}

int main ( ) {
    runTheStateMachine();
    return (parserState == STATE_EndOfData ? 0 : 1);
}

该代码做出以下假设:

  • 代币永远不会超过4096个字符。
  • 分隔符是逗号
    (这是CVS所暗示的,但并非所有CVS文件都使用逗号进行此目的)
  • 字符串始终引用
    (通常这是可选的,除非它们包含空格或引号)
  • 引用的字符串中没有换行符
    (通常允许)
  • 该代码假设未引用的所有内容都是数字,但它无法验证该数字的格式是否正确。

此代码绝对无法解析您提供的任何CSV数据,但是当您将其提供给该文件时:

"Year","Brand","Model"   ,"Description",  "Price"
    1997,"Ford", "E350","ac, abs, moon", 3000.00
1999,"Chevy","Venture ""Extended Edition""",,4900.00
 1999,"Chevy",     "Venture ""Extended Edition, Very Large"""  ,  , 5000.00
1996,"Jeep", "Grand Cherokee","MUST SELL!"

它将产生输出:

Token: >>>Year<<<
Token: >>>Brand<<<
Token: >>>Model<<<
Token: >>>Description<<<
Token: >>>Price<<<
------------------------------------------
Token: >>>1997<<<
Token: >>>Ford<<<
Token: >>>E350<<<
Token: >>>ac, abs, moon<<<
Token: >>>3000.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition"<<<
Token: >>><<<
Token: >>>4900.00<<<
------------------------------------------
Token: >>>1999<<<
Token: >>>Chevy<<<
Token: >>>Venture "Extended Edition, Very Large"<<<
Token: >>><<<
Token: >>>5000.00<<<
------------------------------------------
Token: >>>1996<<<
Token: >>>Jeep<<<
Token: >>>Grand Cherokee<<<
Token: >>>MUST SELL!<<<
------------------------------------------

它只能让你知道如何用状态机解析复杂的语法。这段代码远远没有生产质量,正如你所看到的,这样的switch很快变大,所以我至少把状态代码放到函数中,甚至把每个状态变成类似结构或类似的东西。数据封装的对象,否则很快就会变得无法管理。