如果正则表达式不在标记

时间:2016-06-05 21:35:56

标签: php regex

我正在尝试匹配'<TAG2>',只要它不在<TAG>内。

例如:

This is a WORD --- Match
<TAG><TAG2>xxx</TAG2></TAG> --- Not a match
<TAG>xxxxxxx<TAG2>yyyy</TAG2>xxxxxxx</TAG>  --- Not a match

我正在使用PHP,所以我不能做一个可变长度负面的后视。

我尝试在Match text not inside span tags中使用正则表达式,但如果有多个标记,这在我的情况下不起作用。

<TAG><TAG2>xxx</TAG2></TAG>
<TAG><TAG2>xxx</TAG2></TAG>  - This will match from the first <TAG2> to  the end of the second </TAG2>.  I'm assuming this is because my regex includes <TAG2>[\s\S]*</TAG2>

1 个答案:

答案 0 :(得分:1)

Foreward

我建议使用解析引擎,但听起来您可以对HTML的复杂性进行创造性控制。因此,只要您没有复杂的嵌套情况或其他奇怪的边缘情况,那么这应该可行。

描述

(<tag2>.*?</tag2>)|<tag>(?:(?!<tag\s?>).)*

Regular expression visualization

此正则表达式将执行以下操作:

  • 使用<tag2>...</tag2填充捕获组1,前提是此标记尚未包含在<tag>...</tag>内,如<tag>.<tag2>..</tag2>.</tag>
  • 这也将匹配所有<tag>...<tag>,但是在匹配发生的地方,捕获组1将没有值。

实施例

现场演示

https://regex101.com/r/uQ7xR5/1

示例文字

This <tag2>is a WORD</tag2> --- Match
<TAG><TAG2>xxx</TAG2></TAG> --- Not a match
<TAG>xxxxxxx<TAG2>yyyy</TAG2>xxxxxxx</TAG>  --- Not a match

样本匹配

请注意,捕获组1仅由<tag2>...</tag2表示,而<tag>..</tag>

中没有封装它
[0][0] = <tag2>is a WORD</tag2>
[0][1] = <tag2>is a WORD</tag2>

[1][0] = <TAG><TAG2>xxx</TAG2></TAG> --- Not a match
[1][1] = 

[2][0] = <TAG>xxxxxxx<TAG2>yyyy</TAG2>xxxxxxx</TAG>  --- Not a match
[2][1] = 

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    <tag2>                   '<tag2>'
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
    </tag2>                  '</tag2>'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  <tag>                    '<tag>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
      <tag                     '<tag'
----------------------------------------------------------------------
      \s?                      whitespace (\n, \r, \t, \f, and " ")
                               (optional (matching the most amount
                               possible))
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    .                        any character except \n
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------