Question

我试图将字符串缩减为以下标记：singlequote，right paren，lparen，integer，whitespace和ID。 ID是任何其他任何东西都不是。我的tokenizer没有找到ID。

import re
import collections

QUOTE = r'(?P<QUOTE>\')'
LPAREN  = r'(?P<LPAREN>\()'
RPAREN  = r'(?P<RPAREN>\))'
INT = r'(?P<INT>\d+)'
WS  = r'(?P<WS>\s+)'
ID = r'(<?P<ID>.*)'

tok_regex = '|'.join((QUOTE, LPAREN, RPAREN, INT, ID, WS))
Token = collections.namedtuple('Token', ['type', 'value'])

def tokenize(text):
    for mo in re.finditer(tok_regex, text):
        kind = mo.lastgroup
        value = mo.group(kind)
        yield Token(kind, value)

tokenstream = tokenize(r'(123 a)')

print(next(tokenstream))
print(next(tokenstream))
print(next(tokenstream))
print(next(tokenstream))
print(next(tokenstream))

给我这个输出：

Token(type='LPAREN', value='(')
Token(type='INT', value='123')
Token(type='WS', value=' ')
Token(type='RPAREN', value=')')

为什么ID＆＃39; a＆＃39;没找到？ ID在正则表达式之前。我的ID正则表达式不正确吗？

Answer 1

ID = r'(?P<ID>[^\d\'\(\)\s]+)'

修正了一个拼写错误，并抓住了其他任何令牌之外的所有内容。

正则表达式组未找到通用令牌

1 个答案: