找到字符时删除完整的HTML标记

时间:2016-02-06 16:43:17

标签: php regex

字符串包含带有单词+后缀的HTML标记(在本例中为... rem)

示例:

<b>SomeText...rem</b>
<u>SomeText...rem</u>
<strong>SomeText...rem</strong>
<a href="/">SomeText...rem</a>
<div>SomeText...rem</div>

当HTML标记内的单词包含

...rem

应删除完整的HTML标记+字词。

我可以重命名&#34; ... rem&#34;。它只是一个标记。

这可能吗?

2 个答案:

答案 0 :(得分:1)

我强烈建议您使用HTML parser。但是,由于您的问题要求使用正则表达式,因此您可以使用以下内容并替换回调中的匹配项。

/(?s)<(\w+)[^>]*>(.*?)<\/\1>/

<强>解释

  • (?s) - s标记,以便.字符也匹配换行符。
  • <(\w+)[^>]*> - 匹配一个开头HTML标记并捕获元素名称
  • (.*?) - 第二个捕获组以匹配HTML标记的内容
  • <\/\1> - 根据第一个捕获组(标记名称)使用反向引用来匹配结束HTML标记。

如果第二个捕获组包含子串...rem,则使用function preg_replace_callback以将匹配替换为空的sting。否则,通过将匹配替换为自身来做任何事情。

Live Example Here

preg_replace_callback('/(?s)<(\w+)[^>]*>(.*?)<\/\1>/', function ($m) {
  return strpos($m[2], '...rem') !== false ? '' : $m[0];
}, $string);

答案 1 :(得分:0)

以为我会开枪。
使用PHP,这是一个确切的方法。

更新版本

这使用\K构造,因此无需回写
跟踪器数据到字符串。只需替换 nothing 通过这种方式也可以提高速度。

Formatted and tested:

 # ** Usage **
 # -----------------
 # Find: ''~(?s)(?:(?:(?&Comment)?(?!(?&RawContent)|(?&Comment)).)*\K(?(?=\z)|(?<OpenTag>(?><(?:(?<TagName>[\w:]+)(?:".*?"|\'.*?\'|[^>]*?)+)>)(?<!/>))(?<Body>(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?(?=.)(?&RawContent)(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?)(?<CloseTag>(?><(?:/\2\s*)>)))|.*?(?:(?&RawContent)|(?&Comment))\K)(?(DEFINE)(?<RawContent>\.\.\.rem)(?<Tag_Not_TargetOpen>(?><(?:(?!\2)[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>|(?&Comment)))(?<Char_Not_Tag>(?!(?><(?:[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>)|(?&Comment)).)(?<Comment>(?><(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?)))>)))~'
 # Replace: nothing

 # Dot-all modifier
 (?s)

 # Single group, two alternatives.

 (?:
      # Alternative 1 (highest priority)
      # =================================

      # This is the bactracker. This is crucial !
      # We go all the way up until we find
      # the raw content we are looking for,
      # or comments (because they could hide tags).
      # Then we backtrack from there to 
      # find the closest inner open/close tags
      # that contain our content.

      # Tracker1 - Formerly captured, was the replacements
      (?:
           (?&Comment)? 
           (?!
                (?&RawContent) 
             |  (?&Comment) 
           )
           . 
      )*

      # Prevent Tracker1 need to write back
      \K 

      # Conditional Assertion -
      # Have we reached the end of string without 
      # finding the tagged Content ?

      (?(?= \z )
           # ---------------------------------------------
           # Yes -  Don't do anything, the remainder is in
           # Tracker1 and is thrown away.
           # ---------------------------------------------

        |  
           # ---------------------------------------------
           # No - Find the tagged Content.
           # If no match, Tracker1 will backtrack 1 char and retry.
           # Here, Tracker1 will find up to the point
           # of the tagged Content and be consumed, but thrown away.
           # ---------------------------------------------

           # Get Target Open tag
           (?<OpenTag>                         # (1)
                (?>
                     <
                     (?:
                          (?<TagName> [\w:]+ )                # (2), tag name
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                     )
                     >
                )
                (?<! /> )
           )

           # Get Body containing the raw content   
           (?<Body>                            # (3)

                # Stuff before raw content
                (?&Char_Not_Tag)*? 
                (?:
                     (?&Tag_Not_TargetOpen) 
                     (?&Char_Not_Tag)*? 
                )*?

                # The raw content we need
                (?= . )
                (?&RawContent)                       

                # Stuff after raw content
                (?&Char_Not_Tag)*? 
                (?:
                     (?&Tag_Not_TargetOpen) 
                     (?&Char_Not_Tag)*? 
                )*?
           )

           # Get Target Close tag
           (?<CloseTag>                        # (4)
                (?>
                     <
                     (?: / \2 \s* )
                     >
                )
           )
      )
   |  
      # Alternative 2 (lowest priority)
      # =================================

      # Here, we've already backtracked all
      # possibilities from Tracker1.
      # At this point, we have raw content, 
      # or comments that we must get past.
      # Comments because they could hide tags.
      # Just take it off, it will be thrown away.

      # Tracker2 - Formerly captured, was the replacements
      .*? 
      (?:
           (?&RawContent) 
        |  (?&Comment) 
      )

      # Prevent Tracker2 need to write back
      \K 
 )



 # Functions
 # -----------------------
 (?(DEFINE)

      (?<RawContent>                      # (5)

           # Raw content we are looking for.
           # Note - this is content and is not contained
           # in tags nor comments.

           \.\.\.rem                           # '...rem' or whatever
      )

      (?<Tag_Not_TargetOpen>              # (6)

           # Consume any tag that
           # is not the target Open tag.
           # Comsume comment as well.
           (?>
                <
                (?:
                     (?! \2 )
                     [\w:]+ 
                     (?: " .*? " | ' .*? ' | [^>]*? )+
                )
                >
             |  
                (?&Comment) 
           )
      )

      (?<Char_Not_Tag>                    # (7)

           # Consume any charater
           # that does not begin a tag or comment
           (?!
                (?>
                     <
                     (?:
                          [\w:]+ 
                          (?: " .*? " | ' .*? ' | [^>]*? )+
                     )
                     >
                )
             |  
                (?&Comment) 
           )
           .  
      )

      (?<Comment>                         # (8)

           # Comment
           (?>
                <
                (?:
                     !
                     (?:
                          (?: DOCTYPE .*? )
                       |  (?: \[CDATA\[ .*? \]\] )
                       |  (?: -- .*? -- )
                       |  (?: ATTLIST .*? )
                       |  (?: ENTITY .*? )
                       |  (?: ELEMENT .*? )
                     )
                )
                >
           )
      )
 )

测试用例

输入:

<div>blah blah <i>some text</i> ...rem</div>
<b>SomeText...rem</b>
<u>SomeText...rem</b>
<strong>SomeText...rem</b>
<a href="/">SomeText...rem</a>
<div>SomeText...rem</div>

输出:

 **  Grp 0                      -  ( pos 0 , len 44 ) 
<div>blah blah <i>some text</i> ...rem</div>  
 **  Grp 1 [OpenTag]            -  ( pos 0 , len 5 ) 
<div>  
 **  Grp 2 [TagName]            -  ( pos 1 , len 3 ) 
div  
 **  Grp 3 [Body]               -  ( pos 5 , len 33 ) 
blah blah <i>some text</i> ...rem  
 **  Grp 4 [CloseTag]           -  ( pos 38 , len 6 ) 
</div>  

---------------------

 **  Grp 0                      -  ( pos 46 , len 21 ) 
<b>SomeText...rem</b>  
 **  Grp 1 [OpenTag]            -  ( pos 46 , len 3 ) 
<b>  
 **  Grp 2 [TagName]            -  ( pos 47 , len 1 ) 
b  
 **  Grp 3 [Body]               -  ( pos 49 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 63 , len 4 ) 
</b>  

---------------------

 **  Grp 0                      -  ( pos 86 , len 0 )  EMPTY 
 **  Grp 1 [OpenTag]            -  NULL 
 **  Grp 2 [TagName]            -  ( pos 70 , len 1 ) 
u  
 **  Grp 3 [Body]               -  NULL 
 **  Grp 4 [CloseTag]           -  NULL 

---------------------

 **  Grp 0                      -  ( pos 114 , len 0 )  EMPTY 
 **  Grp 1 [OpenTag]            -  NULL 
 **  Grp 2 [TagName]            -  ( pos 93 , len 6 ) 
strong  
 **  Grp 3 [Body]               -  NULL 
 **  Grp 4 [CloseTag]           -  NULL 

---------------------

 **  Grp 0                      -  ( pos 120 , len 30 ) 
<a href="/">SomeText...rem</a>  
 **  Grp 1 [OpenTag]            -  ( pos 120 , len 12 ) 
<a href="/">  
 **  Grp 2 [TagName]            -  ( pos 121 , len 1 ) 
a  
 **  Grp 3 [Body]               -  ( pos 132 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 146 , len 4 ) 
</a>  

---------------------

 **  Grp 0                      -  ( pos 152 , len 25 ) 
<div>SomeText...rem</div>  
 **  Grp 1 [OpenTag]            -  ( pos 152 , len 5 ) 
<div>  
 **  Grp 2 [TagName]            -  ( pos 153 , len 3 ) 
div  
 **  Grp 3 [Body]               -  ( pos 157 , len 14 ) 
SomeText...rem  
 **  Grp 4 [CloseTag]           -  ( pos 171 , len 6 ) 
</div>  

以前版本跟Tracker回写。

 # ** Usage **
 # -----------------
 # Find: '~(?s)(?:(?<Tracker1>(?:(?&Comment)?(?!(?&RawContent)|(?&Comment)).)*)(?(?=\z)|(?<OpenTag>(?><(?:(?<TagName>[\w:]+)(?:".*?"|\'.*?\'|[^>]*?)+)>)(?<!/>))(?<Body>(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?(?=.)(?&RawContent)(?&Char_Not_Tag)*?(?:(?&Tag_Not_TargetOpen)(?&Char_Not_Tag)*?)*?)(?<CloseTag>(?><(?:/\3\s*)>)))|(?<Tracker2>.*?(?:(?&RawContent)|(?&Comment))))(?(DEFINE)(?<RawContent>\.\.\.rem)(?<Tag_Not_TargetOpen>(?><(?:(?!\3)[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>|(?&Comment)))(?<Char_Not_Tag>(?!(?><(?:[\w:]+(?:".*?"|\'.*?\'|[^>]*?)+)>)|(?&Comment)).)(?<Comment>(?><(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?)))>)))~'
 # Replace: '$1$6'