Question

在我使用时在Python中使用正则表达式：

WORD = re.compile(r'\w+')

然后使用：

w = 'This is a test'
WORD.findall(w)

我明白了：

['This', 'is', 'a', 'test']

现在我想将half-space字符视为\u200c作为正常的字母数字字符，如果我有：

w = 'This\u200cis a test'

然后当我跑WORD.findall(w)时，我得到：

['This\u200cis', 'a', 'test']

我该怎么做？

Answer 1

除了\u200c（Python 3.x +）之外，使用character classes包含\w：

>>> import re
>>> re.findall(r'[\u200c\w]+', 'This\u200cis a test')
['This\u200cis', 'a', 'test']

在Python 2.x中，您需要使用unicode：

>>> re.findall(u'[\u200c\w]+', u'This\u200cis a test')
[u'This\u200cis', u'a', u'test']