作为一项教育活动,我打算用Python编写Python词法分析器。 最后,我想实现一个可以自行运行的Python的简单子集,所以我希望这个词法分析器用一个相当简单的Python子集编写,尽可能少的导入。
我发现涉及lexing的教程,例如kaleidoscope,预测一个角色,以确定接下来会有什么标记,但我担心这对Python来说是不够的(一方面,只看一个你不能区分分隔符或运算符,或者标识符和关键字之间的字符;此外,处理缩进看起来像是一个新的野兽;除其他外)。
我发现这个link非常有用,但是,当我尝试实现它时,我的代码很快就开始看起来非常丑陋,有很多if
语句和个案,而且它没有看起来这是“正确”的做法。
那里有没有什么好的资源可以帮助/教我lex这种代码(我也想完全解析它,但首先要做的事情是对吗?)
我不是在使用解析器生成器,但我希望生成的Python代码使用Python的一个简单子集,并且合理地自包含,这样我至少可以梦想拥有一种可以解释自己的语言。 (例如,根据我对此example的理解,如果我使用ply,我将需要我的语言来解释ply包以及解释自身,我想这将使事情变得更复杂)。
答案 0 :(得分:2)
看看http://pyparsing.wikispaces.com/你可能觉得它对你的任务有用。
答案 1 :(得分:1)
我使用过传统的flex/lex& bison/yacc过去的类似项目。我也使用了ply(python lex yacc),我发现这些技能可以从一个转移到另一个。
因此,如果您以前从未编写过解析器,我会使用ply编写您的第一个解析器,并且您将学习一些有用的技能以用于以后的项目。
当您使用ply解析器工作时,您可以手动制作一个教育练习。在我的经验中,手工编写词法分析器和解析器非常快速 - 解析器生成器的成功!
答案 2 :(得分:0)
考虑一下PyPy,一个基于python的python实现。它显然也有一个python解析器。
答案 3 :(得分:0)
这个简单的基于正则表达式的词法分析器已经为我提供了几次,非常好:
#-------------------------------------------------------------------------------
# lexer.py
#
# A generic regex-based Lexer/tokenizer tool.
# See the if __main__ section in the bottom for an example.
#
# Eli Bendersky (eliben@gmail.com)
# This code is in the public domain
# Last modified: August 2010
#-------------------------------------------------------------------------------
import re
import sys
class Token(object):
""" A simple Token structure.
Contains the token type, value and position.
"""
def __init__(self, type, val, pos):
self.type = type
self.val = val
self.pos = pos
def __str__(self):
return '%s(%s) at %s' % (self.type, self.val, self.pos)
class LexerError(Exception):
""" Lexer error exception.
pos:
Position in the input line where the error occurred.
"""
def __init__(self, pos):
self.pos = pos
class Lexer(object):
""" A simple regex-based lexer/tokenizer.
See below for an example of usage.
"""
def __init__(self, rules, skip_whitespace=True):
""" Create a lexer.
rules:
A list of rules. Each rule is a `regex, type`
pair, where `regex` is the regular expression used
to recognize the token and `type` is the type
of the token to return when it's recognized.
skip_whitespace:
If True, whitespace (\s+) will be skipped and not
reported by the lexer. Otherwise, you have to
specify your rules for whitespace, or it will be
flagged as an error.
"""
# All the regexes are concatenated into a single one
# with named groups. Since the group names must be valid
# Python identifiers, but the token types used by the
# user are arbitrary strings, we auto-generate the group
# names and map them to token types.
#
idx = 1
regex_parts = []
self.group_type = {}
for regex, type in rules:
groupname = 'GROUP%s' % idx
regex_parts.append('(?P<%s>%s)' % (groupname, regex))
self.group_type[groupname] = type
idx += 1
self.regex = re.compile('|'.join(regex_parts))
self.skip_whitespace = skip_whitespace
self.re_ws_skip = re.compile('\S')
def input(self, buf):
""" Initialize the lexer with a buffer as input.
"""
self.buf = buf
self.pos = 0
def token(self):
""" Return the next token (a Token object) found in the
input buffer. None is returned if the end of the
buffer was reached.
In case of a lexing error (the current chunk of the
buffer matches no rule), a LexerError is raised with
the position of the error.
"""
if self.pos >= len(self.buf):
return None
else:
if self.skip_whitespace:
m = self.re_ws_skip.search(self.buf, self.pos)
if m:
self.pos = m.start()
else:
return None
m = self.regex.match(self.buf, self.pos)
if m:
groupname = m.lastgroup
tok_type = self.group_type[groupname]
tok = Token(tok_type, m.group(groupname), self.pos)
self.pos = m.end()
return tok
# if we're here, no rule matched
raise LexerError(self.pos)
def tokens(self):
""" Returns an iterator to the tokens found in the buffer.
"""
while 1:
tok = self.token()
if tok is None: break
yield tok
if __name__ == '__main__':
rules = [
('\d+', 'NUMBER'),
('[a-zA-Z_]\w+', 'IDENTIFIER'),
('\+', 'PLUS'),
('\-', 'MINUS'),
('\*', 'MULTIPLY'),
('\/', 'DIVIDE'),
('\(', 'LP'),
('\)', 'RP'),
('=', 'EQUALS'),
]
lx = Lexer(rules, skip_whitespace=True)
lx.input('erw = _abc + 12*(R4-623902) ')
try:
for tok in lx.tokens():
print(tok)
except LexerError as err:
print('LexerError at position %s' % err.pos)