RegEx to remove all markup between <a and="" <="" a=""> tags except for within [ and ]

时间:2015-06-30 19:25:58

标签: html regex html-parsing

Trying to figure out a Regular Expression gives me a brain cramp :)

I'm replacing thousands of individual hreflinks with an individual shortcode in WordPress post content using a plugin that allows me to run regular expressions on content.

Rather than try and combine an SQL query with a RegEx, I'm doing it in two stages: first the SQL to find/replace each individual URL to the individual shortcode, and the second stage, remove the rest of the 'href` link markup.

These are some examples of what I have now from the first step; as you can see, the URL has been replaced with the [nggallery id=xxx] shortcode.

<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>

Now, I need to delete all the href link markup - span, img, etc - in between the leading <a and ending </a>, leaving just the shortcode [nggallery id=xxx].

I've got a start here: https://www.regex101.com/r/rL8wP1/2

But I don't know how to prevent the [nggallery id=xxx] shortcode from being captured in the RegEx.

Update 7/09/2015

@nhahtdh's answer appears to work perfectly, is not too greedy, and doesn't eat adjacent html links. Use ( and ) as delimiters and $1 as a replacement with a regex plugin in WordPress. (If using BBEdit, you will need to use \1)

( <a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a> )

Update 7/02/2015

Thanks to Fab Sa (answer below), his regex at https://www.regex101.com/r/rL8wP1/4

<a.*(\[nggallery[^\]+]*\]).*?<\/a>

works in the regex101 emulator, but when used in the BBEdit text editor or the WordPress plugin that runs regex, his regex deletes the [nggallery id=***] shortcode. So is it too greedy? Some other issue?

Update 7/01/2015:

I know, I know, re: RegEx match open tags except XHTML self-contained tags YOU CANNOT PARSE HTML WITH REGEX

9 个答案:

答案 0 :(得分:7)

您可以使用此正则表达式

<a.*(\[nggallery[^\]+]*\]).*?<\/a>

全局(标记 g )。此正则表达式将匹配链接并保存[nggallery ...]部分。您可以使用$ 1替换所有匹配项以保留已保存的[nggallery ...]部分。

我已在线更新您的正则表达式:https://www.regex101.com/r/rL8wP1/4

PS:在此解决方案[nggallery ...]中,不需要处于 href 等特定属性中。如果您想强制执行此操作,可以使用<a.*href\="(\[nggallery[^\]+]*\])".*?<\/a>

答案 1 :(得分:7)

Fab Sa的正则表达式<a.*(\[nggallery[^\]+]*\]).*?<\/a>在单行上有多个<a>标记时会占用所有内容,因为开头的.*不受限制,会匹配不同的<a>代码。

通过限制允许的字符,您可以在某种程度上匹配您想要的字符:

<a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a>
  ^^^^^^^

我在a之后强制至少有一个空格,以确保它与其他标签不匹配,加上一些额外的限制。

无论如何,如果你发现它在一些极端情况下不起作用,你就是独立的。用正则表达式操纵HTML通常是一个坏主意。

答案 2 :(得分:5)

是的,你不能用正则表达式解析html,如何用简约的lexer-parser使这个行为变得无懈可击? 它可以让您更灵活地控制代码。

<?php

$src = <<<EOF
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>
EOF;

// we "eat up" the source string by opening <a> tags, closing <a> tags or text
$tokens = array();
while ($src){
    // check if $src begins with this pattern <a (any optional prop)[nggallery (any string)] (any optional prop)>
    if (preg_match('/^<a [^>]*(\[nggallery [^\]]+\])[^>]*>/s', $src, $match)){
        // here you can handle data with more flexibility
        // you can grab the id or the [placeholder] via 
        //$match[1] = [nggallery id=xyz]

        // we store the chunk of string and label it as an opening tag
        $tokens[] = array('type' => 'OPENING_A', 'value' => $match[0]);
    }else if (preg_match('/^<\/a>/s', $src, $match)){
        // we store the chunk of string and label it as a closing tag
        $tokens[] = array('type' => 'CLOSING_A', 'value' => $match[0]);
    }else if (preg_match('/^./s', $src, $match)){
        // we store the chunk of string, in this case a character and label it as text
        $tokens[] = array('type' => 'TEXT', 'value' => $match[0]);
    }
    // finally we remove the identified pattern from the source string
    // and continue "eating it up"
    $src = substr($src, strlen($match[0]));
}

// once the source string has been consumed, we get this array
// var_dump($tokens);
// array (size=247)
//   0 => 
//     array (size=2)
//       'type' => string 'OPENING_A' (length=9)
//       'value' => string '<a href="[nggallery id=xx]">' (length=28)
//   1 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string '<' (length=1)
//   2 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string 's' (length=1)
//   3 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string 'p' (length=1)
//       ... ommited for brevity


// now with all the parsed data, we can rebuild the html
// as needed
$html = '';
// we keep a flag to now if we are inside a tag
// marked with ngggallery
$insideNGGalleryTag = false;

foreach ($tokens as $token){
    if ($token['type'] == 'OPENING_A'){
        $insideNGGalleryTag = true;
        $html .= $token['value'];
    }else if ($token['type'] == 'CLOSING_A'){
        $insideNGGalleryTag = false;
        $html .= $token['value'];
    }else{
        // if we are inside a nggallery tag, we will ignore
        // all text inside it. here you could also remove
        // html properties from the tag, move the [nggallery placeholder]
        // inside the <a> or some other behavior you might need
        if (!$insideNGGalleryTag){
            $html .= $token['value'];
        }
    }
}

// finally echo or write to file the
// modified html, in this case it would return
var_dump($html);
// <a href="[nggallery id=xx]"></a>
// <a href="[nggallery id=xxxxx]"></a>
// <a title="title title" href="[nggallery id=xxx]" target="_blank"></a>

答案 3 :(得分:1)

subprocess.call()

这会将短代码/<a\b[^>]*href\s*=\s*"(\[nggallery id=[^"]+\])".*?<\/a>/i 放入组1中,然后将匹配替换为组1的内容。

注意:这假设HTML格式合理,通常的免责声明适用。

答案 4 :(得分:1)

这个问题已经很晚了,但我想我会把它扔进去 (注意警告!!这可能很难看......)

修改:用于BBEdit。
注 - BBEdit使用PCRE引擎。可以找到BBEdit正则表达式构造 在这里:https://gist.github.com/ccstone/5385334

Formatted:

 # (?s)(<a(?=\s)(?>(?:(?<=\s)href\s*=\s*"\s*(\[nggallery\s+id\s*=\s*[^"\]>]*?\])"|".*?"|'.*?'|[^>]*?)+>)(?<!/>)(?(2)|(?!))).*?</a\s*>

 (?s)
 (                             # (1 start), Capture open a tag
      <a                            # Open a tag
      (?= \s )
      (?>                           # Atomic
           (?:
                (?<= \s )
                href \s* = \s*                # href attribute
                "
                \s* 
                (                             # (2 start), Capture shortcode value
                     \[nggallery \s+ 
                     id \s* = \s* [^"\]>]*? 
                     \]
                )                             # (2 end)
                "
             |  " .*? "
             |  ' .*? '
             |  [^>]*? 
           )+
           >
      )
      (?<! /> )                     # Not a self contained closure
      (?(2)                         # Only a tags with href attr, shortcode value
        |  (?!)
      )
 )                             # (1 end)
 .*?                           # Stuff inbetween
 </a \s* >                     # Close a tag

输出:

 **  Grp 0 -  ( pos 0 , len 240 ) 
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>  
 **  Grp 1 -  ( pos 0 , len 28 ) 
<a href="[nggallery id=xx]">  
 **  Grp 2 -  ( pos 9 , len 17 ) 
[nggallery id=xx]  
----------------
 **  Grp 0 -  ( pos 244 , len 46 ) 
<a href="[nggallery id=xxxxx]">Click here!</a>  
 **  Grp 1 -  ( pos 244 , len 31 ) 
<a href="[nggallery id=xxxxx]">  
 **  Grp 2 -  ( pos 253 , len 20 ) 
[nggallery id=xxxxx]  
-----------------
 **  Grp 0 -  ( pos 294 , len 90 ) 
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>  
 **  Grp 1 -  ( pos 294 , len 65 ) 
<a title="title title" href="[nggallery id=xxx]" target="_blank">  
 **  Grp 2 -  ( pos 323 , len 18 ) 
[nggallery id=xxx]  

答案 5 :(得分:1)

这是一个与您的示例完美匹配的正则表达式。

(<a.*?href=")|([^\]]*?<\/a>)

我没有尝试一次匹配整个表达式,而是使用OR运算符指定两个单独的正则表达式,一个用于a标记的开头,<a.*?href=",另一个用于结束标签[^\]]*?<\/a>。这可能会或可能不会在单个替换操作中工作,如果没有,将其拆分为两个替换操作,首先运行一个用于结束标记正则表达式,然后运行一个用于启动标记。如果您有任何其他示例可以打破这个答案,请告诉我。

答案 6 :(得分:1)

当使用JavaScript DOM操作完成时,我不知道为什么要使用正则表达式执行此操作。

我会告诉你基本的方法,给你一个想法:

var div = document.createElement('div');
div.innerHTML = yourString;
var a = div.querySelector('a');
document.body.innerHTML = a.attributes[0].nodeValue;

<强> Working Fiddle

同时检查documentFragment

答案 7 :(得分:1)

由于你没有指定,我假设没有嵌套的锚标签,你只想提取那里的方形括号代码。我还假设你的代码的识别格式是“[nggallery”。

使用此

查找
<\s*a(?=\s|>)[^>]*?(\[nggallery[^\]]+\])[^>]*>(.|\n)+?(<\s*\/\s*a\s*>)

使用

替换
\1

(这应该是BBEdit的第一个捕获组表示法)

答案 8 :(得分:0)

这是怎么回事?

(?<=nggallery\sid=xx]">).*(?=<\/a>)

使用全局和单行作为修饰符(-g和-s)。这与<a href="[nggallery id=xx]"></a>之间的所有内容相匹配。我不确定我是否正确理解你的问题......但是这个RegEx做了我刚才描述的。