Ply - 捕捉词汇错误

时间:2018-04-15 15:00:51

标签: python compiler-construction scanning ply lexical

我很难弄清楚我的代码是否存在错误,或者是否Ply无法识别某些词汇错误。 我正在为我的编译器编写一个Python的小编译器'在大学上课时,我使用lib Ply编写了一些词法分析器。在扫描部分,我有以下标识符和数字规则:

# Rule for identificator
def t_ID(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    t.type = reserved.get(t.value,'ID')    # Check for reserved words
    return t

# Regular expression rules for simple tokens
t_NUMBER  = r'\d+'

数字规则只是暂时的,我将添加一些更具体的东西来处理浮点和整数。 问题是,在测试扫描时,如果输入类似012abc的内容,它会返回给我:

LexToken(NUMBER,' 123',1,0) LexToken(ID,' ABC',1,3)

不应该抛出错误信息吗?

以下是完整的代码。

import ply.lex as lex
import ply.yacc as yacc


# Reserved words
reserved = {
    'if' : 'IF',
    'else' : 'ELSE',
    'while' : 'WHILE',
    'for' : 'FOR',
    'switch' : 'SWITCH',
    'case' : 'CASE',
    'class' : 'CLASS',
    'define' : 'DEFINE',
    'int' : 'INT',
    'float' : 'FLOAT',
    'string' : 'STRING',
    'void' : 'VOID',
    'equal' : 'EQUAL',
    'and' : 'AND',
    'or' : 'OR',
    'not' : 'NOT',
    'do' : 'DO',
}

# The tokens declaration is made here.
tokens = [

    # Literals (identifier, integer constant, float constant, string constant,
    # char const)
    # TODO add constants' Types
    'ID',
    'NUMBER',

    # Operators +,-,*,/,%
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'MOD',

    # Logical Operators
    'LESS_THAN',
    'LESS_EQUAL',
    'GREATER_THAN',
    'GREATER_EQUAL',
    'NOT_EQUAL',

    # Delimeters such as (),{},[],:
    'LPAREN',
    'RPAREN',
    'LBRACE',
    'RBRACE',
    'COLON',

    #Assignment Operators
    'EQUALS'

] 

tokens +=  list(reserved.values())

# Regular expression rules for simple tokens
t_NUMBER  = r'\d+'

# Operators
t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_MOD     = r'\%'

# Logical Operators
t_LESS_THAN     = r'<'
t_LESS_EQUAL    = r'<=' 
t_GREATER_THAN  = r'>'
t_GREATER_EQUAL = r'>='
t_NOT_EQUAL     = r'!='

# Delimiters
t_LPAREN  = r'\('
t_RPAREN  = r'\)'
t_COLON   = r'\:'
t_LBRACE  = r'\{'
t_RBRACE  = r'\}'

# Assignment
t_EQUALS  = r'='


# Rule for identificator
def t_ID(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    t.type = reserved.get(t.value,'ID')    # Check for reserved words
    return t

# Define a rule so we can track line numbers
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

# Compute column.
#     input is the input text string
#     token is a token instance
def find_column(input, token):
    line_start = input.rfind('\n', 0, token.lexpos) + 1
    return (token.lexpos - line_start) + 1

# A string containing ignored characters (spaces and tabs)
t_ignore  = ' \t'

# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# One line comments Python alike
def t_comment(t):
    r"[ ]*\043[^\n]*"  # \043 is '#'
    pass

# Build the lexer
lexer = lex.lex()

# read input for test purposes
#from read_file_into_buffer import readFileIntoBuffer    
#data = readFileIntoBuffer('test.fpl')
data = input()

# feed lexer with input
lexer.input(data)

# TODO - Create a function to open a file from stdinput
# The file shall be passed as argument


# Tokenize
while True:
    tok = lexer.token()
    if not tok: 
        break      # No more input
    print(tok)

0 个答案:

没有答案