Question

我搜索了很多，但似乎仍然无法掌握工作流程以及令牌，词法分析器和解析器的文件排列。我在Visual Studio中使用纯C。通过了无数的教程，他们似乎跳过了这一部分。

因此，当您在source.txt中编写一个简单函数时，词法分析器位于C文件中，读取源文件并将该简单函数分解为令牌吗？

在Dobbs博士的书中，有一个文件，其中包含已定义标记的列表，该列表已写入def文件，如果上一个问题为真，则此预定义标记列表适合何处？

一旦lex读取了源文件并分割了标记，解析器将如何检索lex放出的标记化信息，然后进行语法编制？

我已经看到typedef结构用于预定义标记的轻微分解，以及内部的lex的分解。那么令牌在1个文件中，标记在1个文件中，解析器在1个文件中吗？还是1中的标记但lex和parse一起调用该标记文件？

我只需要澄清一下，就可以朝着正确的方向指出一个要点，非常感谢您的帮助。

Answer 1

组织这些任务的一种简单的经典方法是将各个部分设置为子例程，以相互传递数据结构。原谅我草率的语法。

首先，您需要词法分析器生成的令牌的定义。这几乎总是一个带有枚举以指示哪种令牌类型的结构，以及一个联合类型来承载令牌类型可能具有的任何值：

 struct Token {
     enum  // distinguishes token types
          { EndOfFile, Integer, String, Float, Identifier
            Semicolon, Colon, Plus, Minus, Times, Divide, LefParen, RightParen,
            KeywordBegin, KeywordDeclare, KeywordEnd, ...
          } tokentype
      union {
         long numeric_value; // holds numeric value of integer-valued token
         char* string_value; // holds ptr to string body of identifiers or string literals
         float float_value; // holds numeric value of floating-point valued token
            } tokenvalue
      }

您将要构建一个抽象语法树。为此，您需要一个TreeNode类型。像令牌一样，它们几乎总是一个枚举，以指示哪种节点类型，以及一个联合，以保存节点类型的各种属性，最后是指向子代的指针的列表：

      struct TreeNode {
          enum // distiguishes tree node types
             { Goal, Function, StatementList, Statement, LeftHandSide, Expression,
               Add, Subtract, Times, Divide, Identifier, FunctionCall, ...
             } nodetype
          children* TreeNode[4];  // a small constant here is almost always enough
          union // hold values specific to node type, often includes a copy of lexer union
             { long numeric_value; // holds:
                       // numeric value of integer-valued token
                       // index of built-in function number
                       // actual number of children if it varies
                       // ...
               char* string_value; // holds ptr to string body of identifiers or string literals
               float float_value; // holds numeric value of floating-point valued token
            } nodevalue
        }

MyCompiler.c包含以下代码：

     int main() {
            filehandle Handle = OpenSourceFile(&FileName);
            ASTrootnode TreeNode = Parser(Handle);
            CodeGenerator(ASTrootnode);
            exit();
     }

Parser.c包含以下代码：

     TreeNode Parser(filehandle Handle) {
            <parsingmachinery>
            nexttoken Token=Lexer(filehandle);
            <moreparsingmachinery to build tree nodes>
            return toplevel_TreeNode;
     }

Lexer.c包含以下代码：

    Token Lexer(filehandle Handle) {
         token Token;
         <process characters in buffer>
         if bufferp=end_of_buffer
            fileread(filehandle,&bufferbody,bufferlength);
         <process characters in buffer>
         token.tokentype=<typeofrecognizedtoken>
         token.tokenvalue.long=<valueofnumerictoken>
         ...
         return Token;
    }

您显然会希望将Token和TreeNode声明放入可以在编译器源文件之间共享的头文件中。

如果构建高性能的编译器，则需要优化这些例程。一个简单的例子：FileHandle可以成为全局变量，因此不需要在两部分之间作为显式参数传递。一个不那么琐碎的示例：您将需要一个高性能的词法分析器生成器或对词法分析器进行手工编码，以最大程度地提高其字符处理速度，尤其是在跳过空白和注释时。

如果您想了解有关如何构建用于构建AST的解析器的特定详细信息，请参阅我关于构建递归下降解析器的SO答案：https://stackoverflow.com/a/2336769/120163

Answer 2

是的，第二个示例和您引用的理解在进行一次调整后非常典型。

是的，标头或类似物中存在预定义的令牌列表（通常在枚举（如果允许）中）。 lex和解析器都可以引用该头文件。

但是“ lex”实际上发现的令牌在执行时传递给了解析器。

安排lex和解析文件的程序方法是什么？

2 个答案: