我正在学习在RegexBuddy工作的正则表达式(c#)(喜欢它)。我一直试图用非常具体的模式解析名称。我知道这不可能是完美的,但我认为我非常接近我想要完成的事情。
假设:
到目前为止,这是我的正则表达式(使用一些姓氏前缀):
^(?<first>[ A-Z]+?) (?<last>(?<pfx>(?:(?:EL|DE|LA) )*)[A-Z\-]+?)$
哪个效果很好(捕获第一个,最后一个和最后一个名字前缀):
JOHN SMITH
JOHN JAY SMITH
JOHN JAYEL SMITH
JOHN JAY SMITH-JONES
JOHN JAY JIMMY SMITH JONES -- only "JONES" is in the last name, which is okay for this exercise
JOHN JAY EL AMIN
JOHN JAY DE LA HOYA -- "DE LA HOYA" is the last name
JOHN JAY EL -- a case where "EL" is actually the last name
JOHN EL AMIN
但是这两个在姓氏前缀后面有多部分姓氏(在lastname字段中只捕获了最后一个单词)失败了:
JOHN JAY EL GHAMRY SABE
CICERO JOSE TORRES DE AMORIM SILVA
SO ... 2个问题:
答案 0 :(得分:1)
您可以使用否定前瞻声明您匹配的下一个“中间名称”不是姓氏的前缀之一:
^(?<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$
另请注意,我将空格的显式使用更改为\s
,这被认为是一种很好的做法,因为大多数RegEx引擎都可以设置为完全忽略空格,以便能够更好地格式化表达式。在我的RegEx中,我坚持使用您的程序使用的语法,即使使用(?P<group_name>)
可能更常见。
我刚刚看到,在修复我所犯的错误时,你在执行RegEx时引入了巨大的开销。最初我的表达式按你的意图工作,但是它包含了最后一个中间名和第一个名字组中的姓氏之间的空格,为了解决这个问题,我稍微修改了表达式,修复了空间问题,但是我没有重新运行所有的测试,名称协会没有被我注意到。老实说,我只是阅读用连字符部分分隔的姓氏 - 即使这显然非常容易修复;)
回到我的主要观点:当您按预期使用RegEx功能时,您的版本在名字的每个字符后使用前瞻 - 即使只是在每个空格之后才需要!而且因为前瞻是一项非常昂贵的操作,这使得RegEx慢得多。
我自己修复了RegEx - 以一种保留前端操作数量较少的方式 - 当我在它的时候也改进了姓氏部分,因为我认为连字符应该只出现在字母之间而不是任何地方,即不是最后一个字符或空格之间。而且空间显然不应该是最后一个角色。
为了证明我的RegEx(至少原始和我的固定版本)没有糟糕的性能我运行了一些基准测试和单元测试。如果你想自己运行测试以确保我没有作弊,你可以下载代码here;)测试是用Python编写的,但结果应该与其他RegEx引擎类似。
Timing RegEx by Ron Rosenfeld (^(?P<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?P<Last>\b[-A-Z\s]+)$):
=====================
* Took 0.159s to run test case "JOHN SMITH" 100000 times.
* Took 0.198s to run test case "JOHN JAY SMITH" 100000 times.
* Took 0.209s to run test case "JOHN JAYEL SMITH" 100000 times.
* Took 0.273s to run test case "JOHN JAY SMITH-JONES" 100000 times.
* Took 0.274s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
* Took 0.135s to run test case "JOHN JAY EL AMIN" 100000 times.
* Took 0.143s to run test case "JOHN JAY DE LA HOYA" 100000 times.
* Took 0.130s to run test case "JOHN JAY EL" 100000 times.
* Took 0.109s to run test case "JOHN EL AMIN" 100000 times.
* Took 0.146s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
* Took 0.223s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 2.001s to run all tests 100000 times.
Timing RegEx by Jim McMullen (^(?P<first>(?:(?!(?:EL|DE|LA)\s)[A-Z]+\s?)+)\s+(?P<last>[A-Z\-\s]+)$):
=====================
* Took 0.634s to run test case "JOHN SMITH" 100000 times.
* Took 0.649s to run test case "JOHN JAY SMITH" 100000 times.
* Took 0.659s to run test case "JOHN JAYEL SMITH" 100000 times.
* Took 0.793s to run test case "JOHN JAY SMITH-JONES" 100000 times.
* Took 0.689s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
* Took 0.118s to run test case "JOHN JAY EL AMIN" 100000 times.
* Took 0.126s to run test case "JOHN JAY DE LA HOYA" 100000 times.
* Took 0.168s to run test case "JOHN JAY EL" 100000 times.
* Took 0.100s to run test case "JOHN EL AMIN" 100000 times.
* Took 0.123s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
* Took 0.143s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 4.201s to run all tests 100000 times.
Timing RegEx by Cu3PO42 (^(?P<first>[A-Z]+(?:\s+(?!(?:EL|DE|LA)\s)[A-Z]+)*)\s+(?P<last>[A-Z]+(?:(?:-|\s+)[A-Z]+)*)$):
=====================
* Took 0.157s to run test case "JOHN SMITH" 100000 times.
* Took 0.176s to run test case "JOHN JAY SMITH" 100000 times.
* Took 0.178s to run test case "JOHN JAYEL SMITH" 100000 times.
* Took 0.199s to run test case "JOHN JAY SMITH-JONES" 100000 times.
* Took 0.229s to run test case "JOHN JAY JIMMY SMITH JONES" 100000 times.
* Took 0.148s to run test case "JOHN JAY EL AMIN" 100000 times.
* Took 0.172s to run test case "JOHN JAY DE LA HOYA" 100000 times.
* Took 0.136s to run test case "JOHN JAY EL" 100000 times.
* Took 0.112s to run test case "JOHN EL AMIN" 100000 times.
* Took 0.175s to run test case "JOHN JAY EL GHAMRY SABE" 100000 times.
* Took 0.200s to run test case "CICERO JOSE TORRES DE AMORIM SILVA" 100000 times.
Took 1.881s to run all tests 100000 times.
正如您所看到的,在修复我的RegEx时,您引入了超过100%(!)的开销。我的测试表明,在大多数测试用例中,我的RegEx实际上是最快的,也是最快的,同时提供了更多的功能(关于姓氏),这也消耗了处理时间。
答案 1 :(得分:1)
我会将所有前缀匹配作为名字(使用负面向前看),然后将该行的其余部分与姓氏匹配。
^(?<First>(?:[-A-Z\s](?!\b(?:DE\sLA|EL|DE|LE)\b))+)\s+(?<Last>\b[-A-Z\s]+)$