Question

我在documentation之前阅读并写了数百个正则表达式，但我不知道如何检测 unicode字母的序列。

# this will detect sequence of English letters
re.compile(r'[a-zA-Z]+')
# this will detect sequence of Unicode letters + [0-9_]
re.compile(r'\w+', re.UNICODE)
# how to detect sequence only unicode letter (without [0-9_])
re.compile(r'????', re.UNICODE)

如何仅匹配 unicode字符而不使用[0-9 _]？

我测试了你的解决方案：

import re
import timeit

def test1():
  regex = re.compile(ur'(?:(?![\d_])\w)+', re.UNICODE)
  return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.')

def test2():
  regex = re.compile(ur'[^\W\d_]+', re.UNICODE)
  return regex.findall(u'Ala ma kota z czarną sierścią - 1halo - halo1.')

print test1()
print test2()

print timeit.timeit(test1)
print timeit.timeit(test2)

和时间是：

[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo']
[u'Ala', u'ma', u'kota', u'z', u'czarn\u0105', u'sier\u015bci\u0105', u'halo', u'halo']
11.0143377108
7.42619199741

Answer 1

您可以将否定前瞻与\w结合使用，以匹配除了数字和下划线之外的“字词”：

re.compile(r"(?:(?![\d_])\w)+", re.UNICODE)

Answer 2

使用Unicode字符串和源编码，然后查找您在评论中指定的字符。 Python 2.7没有“Unicode alpha characters”的快捷方式：

# coding: utf8
import re
expr = re.compile(ur'(?u)[^\W\d_]+')
s = u'The quick brown fóx jumped over Łhe laży dog 17 times.'
for i in expr.finditer(s):
    print i.group(0)

输出：

The
quick
brown
fóx
jumped
over
Łhe
laży
dog
times

如果您想要所有Unicode考虑大写和小写Unicode字母，请参阅this answer。

Answer 3

试试这个这匹配任何没有数字的unicode字符

re.compile(r'\D')

如何编写匹配Python中所有unicode字符的正则表达式？

3 个答案: