Question

我正在尝试将hello world123之类的令牌标记为hello，world和123。我认为代码中有两个部分，但是不能将它们正确组合在一起tokenize。

(?u)\b\w+\b
(?<=\D)(?=\d)|(?<=\d)(?=\D)

Answer 1

您可以使用

import re
s = "hello world123"
print(re.findall(r'[^\W\d_]+|\d+', s))
# => ['hello', 'world', '123']

模式详细信息

请参见regex demo。

奖金：要匹配任何字母子串和各种数字，请使用

[^\W\d_]+|[-+]?\d*\.?\d+(?:[eE][+-]?\d+)?

有关正则表达式的详细信息，请参见Parsing scientific notation sensibly?。