Question

我有一个这样的字符串：

s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'

我想要这样的文字：

result = 'the unicode text I want with an é'

我试过使用这段代码：

expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result)  # just to strip out leading/trailing white space

但只要é位于字符串s中，re.search始终返回None。

注意，我尝试使用.*的不同组合代替[\sa-zA-Z]+但没有成功。

Answer 1

字符范围a-z和A-Z仅捕获ASCII字符。您可以使用.来捕获Unicode字符：

>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
 the unicode text I want with an é
>>>

另请注意，我简化了您的模式。这是它的作用：

BEGIN  # Matches BEGIN
(.+?)  # Captures one or more characters non-greedily
END    # Matches END

此外，您不需要Regex从字符串末尾删除空格。只需使用str.strip：

>>> ' a '.strip()
'a'
>>>

使用re模块提取unicode子字符串

1 个答案: