Question

我试图通过在Firstname Lastlame表单上做出假设来捕获名字。这适用于下面的代码，但我希望能够捕获像Pär Åberg这样的国际名称。我找到了一些解决方案，但它们确实不适合使用Python风格的regexp。有人对此深有体会吗？

#!/usr/bin/python
# -*- coding: utf-8 -*- 
import re

text = """
This is a text containing names of people in the text such as 
Hillary Clinton or Barack Obama. My problem is with names that uses stuff 
outside A-Z like Swedish names such as Pär Åberg."""

for name in re.findall("(([A-Z])[\w-]*(\s+[A-Z][\w-]*)+)", text):
    firstname = name[0].split()[0]
    print firstname

Answer 1

您需要一个替代regex library，因为您可以使用\p{L} - 任何Unicode字母。

然后，使用

ur'\p{Lu}[\w-]*(?:\s+\p{Lu}[\w-]*)+'

使用Unicode字符串初始化正则表达式时，会自动使用UNICODE标志：

如果未指定ASCII，LOCALE和UNICODE标志，则如果正则表达式模式是Unicode字符串并且UNICODE，则默认为ASCII如果它是一个字节串。

在表单Firstname姓氏上匹配国际字符的名称

1 个答案: