从正则表达式获取所有名称

时间:2019-12-10 16:16:48

标签: php regex

我为字符串中的各种名称制作了正则表达式:

$nameRegex = "/[A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ']" .
    "[.A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽa-z-àáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšž']" .
    '+\b(?: \b' .
    "[A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ']?[van|de]" .
    "[A-Z-ÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽa-z-àáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšž']+\b)*/u";

我正在尝试匹配所有非标准案例,例如:

John Doe waves                        | John Doe
Bakary N'Diaye says hello             | Bakary N'Diaye
Iván Aguilar goes well                | Iván Aguilar
Cisteró shot                          | Cisteró
Dan I Soylu shots                     | Dan I Soylu
Mike van der Hoorn with a cross       | Mike van der Hoorn
M.J. Williams takes a shot            | M.J. Williams
Donny van de Beek left foot           | Donny van de Beek
Mike van der Hoorn hello              | Mike van der Hoorn
Artak G. Grigoryan with through ball  | Artak G. Grigoryan
Trent Alexander-Arnold after a break  | Trent Alexander-Arnold

但是我的人在匹配这些名称方面做得很差-在这里您可以在操作https://regexr.com/4qgbt中看到它。

我该如何改善我的正则表达式,使其捕捉所有名称? (名字在句子的开头)

1 个答案:

答案 0 :(得分:2)

也许,类似的表达,

^([\p{L} '.-]+?)(?:\s[a-z]+)*\h*$

可以(其中preg_match_all)分为两个组。左边的起始组是一个捕获名称的组,右边的第二个组是一个不捕获组的名称,用于随后收集所有内容,我们对此并不感兴趣。

RegEx Demo 1

测试1

$re = '/^([\p{L} \'.-]+?)(?:\s[a-z]+)*\s*$/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

输出1

array(9) {
  [0]=>
  array(2) {
    [0]=>
    string(14) "John Doe waves"
    [1]=>
    string(8) "John Doe"
  }
  [1]=>
  array(2) {
    [0]=>
    string(25) "Bakary N'Diaye says hello"
    [1]=>
    string(14) "Bakary N'Diaye"
  }
  [2]=>
  array(2) {
    [0]=>
    string(17) "Dan I Soylu shots"
    [1]=>
    string(11) "Dan I Soylu"
  }
  [3]=>
  array(2) {
    [0]=>
    string(31) "Mike van der Hoorn with a cross"
    [1]=>
    string(18) "Mike van der Hoorn"
  }
  [4]=>
  array(2) {
    [0]=>
    string(26) "M.J. Williams takes a shot"
    [1]=>
    string(13) "M.J. Williams"
  }
  [5]=>
  array(2) {
    [0]=>
    string(27) "Donny van de Beek left foot"
    [1]=>
    string(17) "Donny van de Beek"
  }
  [6]=>
  array(2) {
    [0]=>
    string(24) "Mike van der Hoorn hello"
    [1]=>
    string(18) "Mike van der Hoorn"
  }
  [7]=>
  array(2) {
    [0]=>
    string(36) "Artak G. Grigoryan with through ball"
    [1]=>
    string(18) "Artak G. Grigoryan"
  }
  [8]=>
  array(2) {
    [0]=>
    string(37) "Trent Alexander-Arnold after a break
"
    [1]=>
    string(22) "Trent Alexander-Arnold"
  }
}

在输入字符串的左侧,似乎没有问题,因为每一行都以名称开头。但是在右侧,行中有小写单词,中间有一个空格。在这里,我们将尝试编写一条语句来查找这些语句,甚至可以使用积极的前瞻性:

(?=(?:\s[a-z]+)*\h*$)

然后再说第二句话,

^[\p{L} '.-]+?

我们将收集名称,最终表达式将变为:

^[\p{L} '.-]+?(?=(?:\s[a-z]+)*\h*$)

RegEx Demo 2 with positive lookahead

测试2

$re = '/^[\p{L} \'.-]+?(?=(?:\s[a-z]+)*\h*$)/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

输出2

array(9) {
  [0]=>
  array(1) {
    [0]=>
    string(8) "John Doe"
  }
  [1]=>
  array(1) {
    [0]=>
    string(14) "Bakary N'Diaye"
  }
  [2]=>
  array(1) {
    [0]=>
    string(11) "Dan I Soylu"
  }
  [3]=>
  array(1) {
    [0]=>
    string(18) "Mike van der Hoorn"
  }
  [4]=>
  array(1) {
    [0]=>
    string(13) "M.J. Williams"
  }
  [5]=>
  array(1) {
    [0]=>
    string(17) "Donny van de Beek"
  }
  [6]=>
  array(1) {
    [0]=>
    string(18) "Mike van der Hoorn"
  }
  [7]=>
  array(1) {
    [0]=>
    string(18) "Artak G. Grigoryan"
  }
  [8]=>
  array(1) {
    [0]=>
    string(22) "Trent Alexander-Arnold"
  }
}

方法3

我想,我们也可以看一下preg_replace函数,完全忘记名称,而专注于匹配一行中名称的右侧边界,也许使用类似于以下内容的简单表达式:

(?:\s[a-z]+){0,}\h*$

或:

(?:\s*\b[a-z]+){0,}\h*$

RegEx Demo

测试3

$re = '/(?:\s[a-z]+){0,}\h*$/m';
$str = 'John Doe waves
Bakary N\'Diaye says hello
Iván Aguilar goes well
Cisteró shot
Dan I Soylu shots
Mike van der Hoorn with a cross
M.J. Williams takes a shot
Donny van de Beek left foot
Mike van der Hoorn hello
Artak G. Grigoryan with through ball
Trent Alexander-Arnold after a break ';

echo preg_replace($re, '', $str);

输出3

John Doe
Bakary N'Diaye
Iván Aguilar
Cisteró
Dan I Soylu
Mike van der Hoorn
M.J. Williams
Donny van de Beek
Mike van der Hoorn
Artak G. Grigoryan
Trent Alexander-Arnold

RegEx Demo 3 for preg_replace

方法4:

也许,这将是最简单,最快的方法。在这里,我们将获得带有贪婪表达式的一行中的最后一个大写字母,然后添加一个\S+\S*

^.*\p{Lu}\S+

^.*\p{Lu}\S*

RegEx Demo 4

或带有数字量符:

^.{0,50}\p{Lu}\S*

RegEx Demo 5


如果您想简化/更新/探索表达式,请在regex101.com的右上角进行解释。如果您有兴趣,可以观看匹配的步骤或在this debugger link中进行修改。调试器演示了a RegEx engine如何逐步使用一些示例输入字符串并执行匹配过程的过程。