我正在尝试解析一个用户输入,即每个单词/名称/数字都被空格分隔(除了由双引号定义的字符串之外)并被推送到列表中。列表一路打印。我以前制作了这个代码的一个版本,但这次我想使用Tokens来使事情变得更清晰。这是我到目前为止所做的,但它不打印任何东西。
#!/util/bin/python
import re
def main ():
for i in tokenizer('abcd xvc 23432 "exampe" 366'):
print (i);
tokens = (
('STRING', re.compile('"[^"]+"')), # longest match
('NAME', re.compile('[a-zA-Z_]+')),
('SPACE', re.compile('\s+')),
('NUMBER', re.compile('\d+')),
)
def tokenizer(s):
i = 0
lexeme = []
while i < len(s):
match = False
for token, regex in tokens:
result = regex.match(s, i)
if result:
lexeme.append((token, result.group(0)))
i = result.end()
match = True
break
if not match:
raise Exception('lexical error at {0}'.format(i))
return lexeme
main()
答案 0 :(得分:2)
我建议使用shlex
模块来分解引用的字符串:
>>> import shlex
>>> s = 'hello "quoted string" 123 \'More quoted string\' end'
>>> s
'hello "quoted string" 123 \'More quoted string\' end'
>>> shlex.split(s)
['hello', 'quoted string', '123', 'More quoted string', 'end']
之后,您可以根据需要对所有令牌(字符串,数字......)进行分类。你唯一缺少的是空间:shlex不关心空间。
这是一个简单的演示:
import shlex
if __name__ == '__main__':
line = 'abcd xvc 23432 "exampe" 366'
tokens = shlex.split(line)
for token in tokens:
print '>{}<'.format(token)
输出:
>abcd<
>xvc<
>23432<
>exampe<
>366<
如果您坚持不剥离引号,请使用posix=False
调用split():
tokens = shlex.split(line, posix=False)
输出:
>abcd<
>xvc<
>23432<
>"exampe"<
>366<
答案 1 :(得分:1)
我认为你的缩进被打破了,这个:
#!/util/bin/python
import re
tokens = (
('STRING', re.compile('"[^"]+"')), # longest match
('NAME', re.compile('[a-zA-Z_]+')),
('SPACE', re.compile('\s+')),
('NUMBER', re.compile('\d+')),
)
def main ():
for i in tokenizer('abcd xvc 23432 "exampe" 366'):
print (i);
def tokenizer(s):
i = 0
lexeme = []
while i < len(s):
match = False
for token, regex in tokens:
result = regex.match(s, i)
if result:
lexeme.append((token, result.group(0)))
i = result.end()
match = True
break
if not match:
raise Exception('lexical error at {0}'.format(i))
return lexeme
main()
打印:
('NAME', 'abcd')
('SPACE', ' ')
('NAME', 'xvc')
('SPACE', ' ')
('NUMBER', '23432')
('SPACE', ' ')
('STRING', '"exampe"')
('SPACE', ' ')
('NUMBER', '366')