用于解析姓氏具有前缀的名称的正则表达式

时间:2014-01-20 20:17:25

标签: c# regex parsing

我正在学习在RegexBuddy工作的正则表达式(c#)(喜欢它)。我一直试图用非常具体的模式解析名称。我知道这不可能是完美的,但我认为我非常接近我想要完成的事情。

假设

  1. 名称模式是FIRST [MIDDLE] LAST,全部大写,其中MIDDLE是可选的,没有标题或后缀
  2. 我想将FIRST和MIDDLE捕获到firstname值,并将LAST转换为lastname值
  3. FIRST和MIDDLE可以包含任意数量的单词
  4. 我知道我不能匹配多个字的姓氏(我没问题)除了2个案例:
    • 带有连字符的姓氏
    • 姓氏有前缀的名称(“EL GHAMRY SABE”,“DE AMORIM SILVA”,“DE LA HOYA”是我数据中的实际例子)
  5. 到目前为止,这是我的正则表达式(使用一些姓氏前缀):

    ^(?<first>[ A-Z]+?) (?<last>(?<pfx>(?:(?:EL|DE|LA) )*)[A-Z\-]+?)$
    

    哪个效果很好(捕获第一个,最后一个和最后一个名字前缀):

    JOHN SMITH
    JOHN JAY SMITH
    JOHN JAYEL SMITH
    JOHN JAY SMITH-JONES
    JOHN JAY JIMMY SMITH JONES  -- only "JONES" is in the last name, which is okay for this exercise
    JOHN JAY EL AMIN
    JOHN JAY DE LA HOYA  -- "DE LA HOYA" is the last name
    JOHN JAY EL  -- a case where "EL" is actually the last name
    JOHN EL AMIN
    

    但是这两个在姓氏前缀后面有多部分姓氏(在lastname字段中只捕获了最后一个单词)失败了:

    JOHN JAY EL GHAMRY SABE
    CICERO JOSE TORRES DE AMORIM SILVA
    

    SO ... 2个问题

    1. 我如何改变我的表达式,以便IF有一个姓氏前缀,包括在前缀(“EL”,“DE”,“LE”,“DE LA”等)之后的所有内容都包含在姓氏字段中,如果没有前缀,只有最后一个字包含在姓氏字段中?
    2. 在我还在学习的时候,你能否为我的正则表达式提出其他改进建议?

2 个答案:

答案 0 :(得分:1)

原始答案

您可以使用否定前瞻声明您匹配的下一个“中间名称”不是姓氏的前缀之一:

^(?<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$

另请注意,我将空格的显式使用更改为\s,这被认为是一种很好的做法,因为大多数RegEx引擎都可以设置为完全忽略空格,以便能够更好地格式化表达式。在我的RegEx中,我坚持使用您的程序使用的语法,即使使用(?P<group_name>)可能更常见。

表现不佳的原因

我刚刚看到,在修复我所犯的错误时,你在执行RegEx时引入了巨大的开销。最初我的表达式按你的意图工作,但是它包含了最后一个中间名和第一个名字组中的姓氏之间的空格,为了解决这个问题,我稍微修改了表达式,修复了空间问题,但是我没有重新运行所有的测试,名称协会没有被我注意到。老实说,我只是阅读用连字符部分分隔的姓氏 - 即使这显然非常容易修复;)
回到我的主要观点:当您按预期使用RegEx功能时,您的版本在名字的每个字符后使用前瞻 - 即使只是在每个空格之后才需要!而且因为前瞻是一项非常昂贵的操作,这使得RegEx慢得多。

如何提高性能?

我自己修复了RegEx - 以一种保留前端操作数量较少的方式 - 当我在它的时候也改进了姓氏部分,因为我认为连字符应该只出现在字母之间而不是任何地方,即不是最后一个字符或空格之间。而且空间显然不应该是最后一个角色。

基准

为了证明我的RegEx(至少原始和我的固定版本)没有糟糕的性能我运行了一些基准测试和单元测试。如果你想自己运行测试以确保我没有作弊,你可以下载代码here;)测试是用Python编写的,但结果应该与其他RegEx引擎类似。

Timing RegEx by Ron Rosenfeld (^(?P<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?P<Last>\b[-A-Z\s]+)$):
=====================

 * Took 0.159s to run test case "JOHN SMITH" 100000 times.
 * Took 0.198s to run test case "JOHN JAY SMITH" 100000 times.
 * Took 0.209s to run test case "JOHN JAYEL SMITH" 100000 times.
 * Took 0.273s to run test case "JOHN JAY SMITH-JONES" 100000 times.
 * Took 0.274s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
 * Took 0.135s to run test case "JOHN JAY EL AMIN" 100000 times.
 * Took 0.143s to run test case "JOHN JAY DE LA HOYA" 100000 times.
 * Took 0.130s to run test case "JOHN JAY EL" 100000 times.
 * Took 0.109s to run test case "JOHN EL AMIN" 100000 times.
 * Took 0.146s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
 * Took 0.223s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 2.001s to run all tests 100000 times.


Timing RegEx by Jim McMullen (^(?P<first>(?:(?!(?:EL|DE|LA)\s)[A-Z]+\s?)+)\s+(?P<last>[A-Z\-\s]+)$):
=====================

 * Took 0.634s to run test case "JOHN SMITH" 100000 times.
 * Took 0.649s to run test case "JOHN JAY SMITH" 100000 times.
 * Took 0.659s to run test case "JOHN JAYEL SMITH" 100000 times.
 * Took 0.793s to run test case "JOHN JAY SMITH-JONES" 100000 times.
 * Took 0.689s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
 * Took 0.118s to run test case "JOHN JAY EL AMIN" 100000 times.
 * Took 0.126s to run test case "JOHN JAY DE LA HOYA" 100000 times.
 * Took 0.168s to run test case "JOHN JAY EL" 100000 times.
 * Took 0.100s to run test case "JOHN EL AMIN" 100000 times.
 * Took 0.123s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
 * Took 0.143s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 4.201s to run all tests 100000 times.


Timing RegEx by Cu3PO42 (^(?P<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?P<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$):
=====================

 * Took 0.157s to run test case "JOHN SMITH" 100000 times.
 * Took 0.176s to run test case "JOHN JAY SMITH" 100000 times.
 * Took 0.178s to run test case "JOHN JAYEL SMITH" 100000 times.
 * Took 0.199s to run test case "JOHN JAY SMITH-JONES" 100000 times.
 * Took 0.229s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
 * Took 0.148s to run test case "JOHN JAY EL AMIN" 100000 times.
 * Took 0.172s to run test case "JOHN JAY DE LA HOYA" 100000 times.
 * Took 0.136s to run test case "JOHN JAY EL" 100000 times.
 * Took 0.112s to run test case "JOHN EL AMIN" 100000 times.
 * Took 0.175s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
 * Took 0.200s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 1.881s to run all tests 100000 times.

正如您所看到的,在修复我的RegEx时,您引入了超过100%(!)的开销。我的测试表明,在大多数测试用例中,我的RegEx实际上是最快的,也是最快的,同时提供了更多的功能(关于姓氏),这也消耗了处理时间。

答案 1 :(得分:1)

我会将所有前缀匹配作为名字(使用负面向前看),然后将该行的其余部分与姓氏匹配。

^(?<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?<Last>\b[-A-Z\s]+)$