我想将文件解析为令牌列表。每个令牌包括至少一行,但可以包含更多行。每个标记都匹配正则表达式。如果输入不是令牌序列(即没有垃圾引导,中间或尾随),我想发出错误信号。我并不关心内存消耗,因为输入文件相对较小。
在Perl中,我会使用类似(伪代码)的东西:
$s = slurp_file ();
while ($s ne '') {
if ($s =~ s/^\nsection (\d)\n\n/p) {
push (@r, ['SECTION ' . $1, ${^MATCH}]);
} elsif ($s =~ s/^some line\n/p) {
push (@r, ['SOME LINE', ${^MATCH}]);
[...]
} else {
die ("Found garbage: " . Dumper ($s));
}
}
我当然可以将这个1:1移植到Python,但有更多的pythonic方法吗? (我不想要逐行解析,然后在顶部构建一个手工制作的状态机。)
答案 0 :(得分:2)
re
模块中有一个undocumented tool,可能会对您有所帮助。您可以像这样使用它:
import re
import sys
def section(scanner, token):
return "SECTION", scanner.match.group(1)
def some_line(scanner, token):
return "SOME LINE", token
def garbage(scanner, token):
sys.exit('Found garbage: {}'.format(token))
# scanner will attempt to match these patterns in the order listed.
# If there is a match, the second argument is called.
scanner = re.Scanner([
(r"section (\d+)$$", section),
(r"some line$", some_line),
(r"\s+", None), # skip whitespace
(r".+", garbage), # if you get here it's garbage
], flags=re.MULTILINE)
tokens, remainder = scanner.scan('''\
section 1
some line
''')
for token in tokens:
print(token)
产量
('SECTION', '1')
('SOME LINE', 'some line')