Question

我必须将词法分析器Sebesda的编程语言的结合（第4章，第2节）中的代码翻译成python。这就是我到目前为止所拥有的：

# Character classes #
LETTER = 0
DIGIT = 1
UNKNOWN = 99

# Token Codes #
INT_LIT = 10
IDENT = 11
ASSIGN_OP = 20
ADD_OP= 21
SUB_OP = 22
MULT_OP = 23
DIV_OP = 24
LEFT_PAREN = 25
RIGHT_PAREN = 26

charClass = ''
lexeme = ''
lexLen = 0
token = ''
nextToken = ''

### lookup - function to lookup operators and parentheses ###
###          and return the token                         ###
def lookup(ch):
    def left_paren():
        addChar()
        globals()['nextToken'] = LEFT_PAREN

    def right_paren():
        addChar()
        globals()['nextToken'] = RIGHT_PAREN

    def add():
        addChar()
        globals()['nextToken'] = ADD_OP

    def subtract():
        addChar()
        globals()['nextToken'] = SUB_OP

    def multiply():
        addChar()
        globals()['nextToken'] = MULT_OP

    def divide():
        addChar()
        globals()['nextToken'] = DIV_OP
    options = {')': right_paren, '(': left_paren, '+': add,
               '-': subtract, '*': multiply , '/': divide}

    if ch in options.keys():
        options[ch]()
    else:
        addChar()

### addchar- a function to add next char to lexeme ###
def addChar():
    #lexeme = globals()['lexeme']
    if(len(globals()['lexeme']) <=98):
        globals()['lexeme'] += nextChar
    else:
        print("Error. Lexeme is too long")

### getChar- a function to get the next Character of input and determine its character class ###
def getChar():
    globals()['nextChar'] = globals()['contents'][0]
    if nextChar.isalpha():
        globals()['charClass'] = LETTER
    elif nextChar.isdigit():
        globals()['charClass'] = DIGIT
    else:
        globals()['charClass'] = UNKNOWN
    globals()['contents'] = globals()['contents'][1:]


## getNonBlank() - function to call getChar() until it returns a non whitespace character ##
def getNonBlank():
    while nextChar.isspace():
        getChar()

## lex- simple lexical analyzer for arithmetic functions ##
def lex():
    globals()['lexLen'] = 0
    getNonBlank()
    def letterfunc():
        globals()['lexeme'] = ''
        addChar()
        getChar()
        while(globals()['charClass'] == LETTER or globals()['charClass'] == DIGIT):
            addChar()
            getChar()
        globals()['nextToken'] = IDENT

    def digitfunc():
        globals()['lexeme'] = ''
        addChar()
        getChar()
        while(globals()['charClass'] == DIGIT):
            addChar()
            getChar()
        globals()['nextToken'] = INT_LIT

    def unknownfunc():
        globals()['lexeme'] = ''
        lookup(nextChar)
        getChar()

    lexDict = {LETTER: letterfunc, DIGIT: digitfunc, UNKNOWN: unknownfunc}
    if charClass in lexDict.keys():
        lexDict[charClass]()
    print('The next token is: '+ str(globals()['nextToken']) + ' The next lexeme is: ' + globals()['lexeme'])

with open('input.txt') as input:
    contents = input.read()
    getChar()
    lex()
    while contents:
        lex()

我使用字符串sum + 1 / 33作为我的示例输入字符串。据我所知，第一次调用顶层的getChar（）会将characterClass设置为LETTER，将contents设置为um + 1 / 33。

程序然后进入while循环并调用lex()。 lex()反过来将{sum}一词累加到lexeme。当letterfunc内的while循环遇到第一个空格字符时，它会中断，退出lex()

由于contents不为空，程序再次通过顶层的while循环。这一次，getNonBlank()＆＃34;内的lex()调用会抛出contents中的空格，并重复与之前相同的过程。

我遇到错误的地方，是最后一个词。我告诉我globals()['contents'][0]在被getChar()调用时超出范围。我并不认为这是一个难以找到的错误，但我已经尝试过手工追踪并且似乎无法发现问题。任何帮助将不胜感激。

Answer 1

这只是因为在成功读取输入字符串的最后3后，digitfunc函数再次迭代getchar。但是在那一刻content已经用尽并且是空的，所以contents[0]被传递到缓冲区的末尾，因此错误。

作为一种变通方法，如果在表达式的最后一个字符后添加换行符或空格，则当前代码不会出现问题。

原因是当最后一个char为UNKNOWN时，你立即从lex返回并退出循环，因为content为空，但如果你正在处理一个数字或一个符号，你循环调用{ {1}}没有测试输入结束。顺便说一句，如果你的输入字符串以右边的paren结尾，那么你的词法分析器就会它并忘记显示它找到了它。

所以你至少应该：

测试getchar中的输入结束：
```
getchar
```

显示最后一个标记：

def getchar():
    if len(contents) == 0:
        # print "END OF INPUT DETECTED"
        globals()['charClass'] = UNKNOWN
        globals()['nextChar'] = ''
        return
    ...

控制是否存在词汇（输入结束时可能会发生奇怪的事情）
```
...
while contents:
    lex()
lex()
```

但是你对globals的使用是坏。在函数中使用全局的常用习惯是在使用之前声明它：

...
if charClass in lexDict.keys():
    lexDict[charClass]()
if lexeme != '':
    print('The next token is: '+ str(globals()['nextToken']) +
          ' The next lexeme is: >' + globals()['lexeme'] + '<')

但Python中的全局变量是代码味道。您可以做的最好是将解析器正确封装在类中。 对象优于全局

在python中编写Sebesda的词法分析器。不适用于输入文件中的最后一个lexeme

1 个答案: