PHP preg-match-all捕获

时间:2013-01-13 18:37:48

标签: php regex preg-match-all

我想用PHP中的preg_match_all在他们自己的组中捕获每一个:

  1. 章节,页面或页面
  2. 指定章节,部分或页面的编号(或字母,如果有)。如果它们之间只有一个空格,则应将其考虑在内
  3. 单词“and”,“或”
  4. 请记住,我想忽略所有书名,字符串中的项目数可能是动态的,正则表达式应该适用于以下所有示例:

    1. Ch1和Sect2b
    2. Ch 4 x unwantedtitle和Sect 5y不需要的标题和Sect6 z和Ch7或Ch8
    3. 到目前为止,这是我设法提出的:

          $str = 'Ch 1 a unwantedtitle and Sect 2b unwanted title and Pg3';
          preg_match_all ('/([a-z]+)(?=\d|\d\s)\s*(\d*)\s*(?<=\d|\d\s)([a-z]?).*?(and|or)?/i', $str, $matches);
      
          Array
          (
              [0] => Array
                  (
                      [0] => Pg3
                  )
      
              [1] => Array
                  (
                      [0] => Pg
                  )
      
              [2] => Array
                  (
                      [0] => 3
                  )
      
              [3] => Array
                  (
                      [0] => 
                  )
      
              [4] => Array
                  (
                      [0] => 
                  )
      
          )
      

      预期结果应为:

          Array
          (
              [0] => Array
                  (
                      [0] => Ch 1 a and 
                      [1] => Sect 2b and 
                      [2] => Pg3
                  )
      
              [1] => Array
                  (
                      [0] => Ch
                      [1] => Sect
                      [2] => Pg
                  )
      
              [2] => Array
                  (
                      [0] => 1
                      [1] => 2
                      [2] => 3
                  )
      
              [3] => Array
                  (
                      [0] => a
                      [1] => b
                      [2] => 
                  )
      
              [4] => Array
                  (
                      [0] => and
                      [1] => and
                      [2] => 
                  )
      
          )
      

2 个答案:

答案 0 :(得分:0)

这是我能得到的最接近的:

$str = 'Ch 1 a unwantedtitle and Sect 2b unwanted title and Pg3';
preg_match_all ('/((Ch|Sect|Pg)\s?(\d+)\s?(\w?))(.*?(and|or))?/i', $str, $matches);


Array
(
    [0] => Array
        (
            [0] => Ch 1 a unwantedtitle and
            [1] => Sect 2b unwanted title and
            [2] => Pg3
        )

    [1] => Array
        (
            [0] => Ch 1 a
            [1] => Sect 2b
            [2] => Pg3
        )

    [2] => Array
        (
            [0] => Ch
            [1] => Sect
            [2] => Pg
        )

    [3] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )

    [4] => Array
        (
            [0] => a
            [1] => b
            [2] => 
        )

    [5] => Array
        (
            [0] =>  unwantedtitle and
            [1] =>  unwanted title and
            [2] => 
        )

    [6] => Array
        (
            [0] => and
            [1] => and
            [2] => 
        )

)

答案 1 :(得分:0)

我就是这样做的。

$arr = array(
    'Ch1 and Sect2b',
    'Ch 1 a unwantedtitle and Sect 2b unwanted title and Pg3',
    'Ch 4 x unwantedtitle and Sect 5y unwanted title and' .
        ' Sect6 z and Ch7 or Ch8a',
    'Assume this is ch1a and ch 2 or ch seCt 5c.' .
        ' Then SECT or chA pg22a and pg 13 andor'
);

foreach ($arr as $a) {
    var_dump($a);
    preg_match_all(
    '~
        \b(?P<word>ch|sect|(pg))
        \s*(?P<number>\d+)
        (?(2)\b|
            \s*
            (?P<letter>(?!(?<=\s)(?:and|or)\b)[a-z]+)?
            \s*
            (?:(?<=\s)(?P<cond>and|or)\b)?
        )
    ~xi'
    ,$a,$m);
    foreach ($m as $k => $v) {
        if (is_numeric($k) && $k !== 0) unset($m[$k]);
        // this is for 'beautifying' the result array
        // note that $m[0] will still return whole matches
    }
    print_r($m);
}

我不得不将pg变成一个捕获组,因为我需要为此明确写一个条件,也就是说,它可以附加一个数字(中间有或没有空格)但是它不能被追加任何考虑页面指示符的字母都不会有“pg23a”中的字母。

这就是为什么我选择命名每个组并通过代码中的内部foreach循环“美化”结果。否则,如果您选择使用数字索引(而不是命名索引),则需要跳过每个$m[2]

要在此处显示示例,请输入$arr中最后一项的输出。

Array
(
    [0] => Array
        (
            [0] => ch1a and
            [1] => ch 2 or
            [2] => seCt 5c
            [3] => pg 13
        )

    [word] => Array
        (
            [0] => ch
            [1] => ch
            [2] => seCt
            [3] => pg
        )

    [number] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 5
            [3] => 13
        )

    [letter] => Array
        (
            [0] => a
            [1] => 
            [2] => c
            [3] => 
        )

    [cond] => Array
        (
            [0] => and
            [1] => or
            [2] => 
            [3] => 
        )

)