Trying to figure out a Regular Expression gives me a brain cramp :)

I'm replacing thousands of individual hreflinks with an individual shortcode in WordPress post content using a plugin that allows me to run regular expressions on content.

Rather than try and combine an SQL query with a RegEx, I'm doing it in two stages: first the SQL to find/replace each individual URL to the individual shortcode, and the second stage, remove the rest of the 'href` link markup.

These are some examples of what I have now from the first step; as you can see, the URL has been replaced with the [nggallery id=xxx] shortcode.

<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>

Now, I need to delete all the href link markup - span, img, etc - in between the leading <a and ending </a>, leaving just the shortcode [nggallery id=xxx].

I've got a start here:

But I don't know how to prevent the [nggallery id=xxx] shortcode from being captured in the RegEx.

Update 7/09/2015

@nhahtdh's answer appears to work perfectly, is not too greedy, and doesn't eat adjacent html links. Use ( and ) as delimiters and $1 as a replacement with a regex plugin in WordPress. (If using BBEdit, you will need to use \1)

( <a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a> )

Update 7/02/2015

Thanks to Fab Sa (answer below), his regex at


works in the regex101 emulator, but when used in the BBEdit text editor or the WordPress plugin that runs regex, his regex deletes the [nggallery id=***] shortcode. So is it too greedy? Some other issue?

Update 7/01/2015:

I know, I know, re: RegEx match open tags except XHTML self-contained tags YOU CANNOT PARSE HTML WITH REGEX

全局(标记 g )。此正则表达式将匹配链接并保存[nggallery ...]部分。您可以使用$ 1替换所有匹配项以保留已保存的[nggallery ...]部分。


PS:在此解决方案[nggallery ...]中,不需要处于 href 等特定属性中。如果您想强制执行此操作,可以使用<a.*href\="(\[nggallery[^\]+]*\])".*?<\/a>

Fab Sa的正则表达式<a.*(\[nggallery[^\]+]*\]).*?<\/a>在单行上有多个<a>标记时会占用所有内容,因为开头的.*不受限制,会匹配不同的<a>代码。





是的,你不能用正则表达式解析html,如何用简约的lexer-parser使这个行为变得无懈可击? 它可以让您更灵活地控制代码。


$src = <<<EOF
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>

// we "eat up" the source string by opening <a> tags, closing <a> tags or text
$tokens = array();
while ($src){
    // check if $src begins with this pattern <a (any optional prop)[nggallery (any string)] (any optional prop)>
    if (preg_match('/^<a [^>]*(\[nggallery [^\]]+\])[^>]*>/s', $src, $match)){
        // here you can handle data with more flexibility
        // you can grab the id or the [placeholder] via 
        //$match[1] = [nggallery id=xyz]

        // we store the chunk of string and label it as an opening tag
        $tokens[] = array('type' => 'OPENING_A', 'value' => $match[0]);
    }else if (preg_match('/^<\/a>/s', $src, $match)){
        // we store the chunk of string and label it as a closing tag
        $tokens[] = array('type' => 'CLOSING_A', 'value' => $match[0]);
    }else if (preg_match('/^./s', $src, $match)){
        // we store the chunk of string, in this case a character and label it as text
        $tokens[] = array('type' => 'TEXT', 'value' => $match[0]);
    // finally we remove the identified pattern from the source string
    // and continue "eating it up"
    $src = substr($src, strlen($match[0]));

// once the source string has been consumed, we get this array
// var_dump($tokens);
// array (size=247)
//   0 => 
//     array (size=2)
//       'type' => string 'OPENING_A' (length=9)
//       'value' => string '<a href="[nggallery id=xx]">' (length=28)
//   1 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string '<' (length=1)
//   2 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string 's' (length=1)
//   3 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string 'p' (length=1)
//       ... ommited for brevity

// now with all the parsed data, we can rebuild the html
// as needed
$html = '';
// we keep a flag to now if we are inside a tag
// marked with ngggallery
$insideNGGalleryTag = false;

foreach ($tokens as $token){
    if ($token['type'] == 'OPENING_A'){
        $insideNGGalleryTag = true;
        $html .= $token['value'];
    }else if ($token['type'] == 'CLOSING_A'){
        $insideNGGalleryTag = false;
        $html .= $token['value'];
        // if we are inside a nggallery tag, we will ignore
        // all text inside it. here you could also remove
        // html properties from the tag, move the [nggallery placeholder]
        // inside the <a> or some other behavior you might need
        if (!$insideNGGalleryTag){
            $html .= $token['value'];

// finally echo or write to file the
// modified html, in this case it would return
// <a href="[nggallery id=xx]"></a>
// <a href="[nggallery id=xxxxx]"></a>
// <a title="title title" href="[nggallery id=xxx]" target="_blank"></a>

这会将短代码/<a\b[^>]*href\s*=\s*"(\[nggallery id=[^"]+\])".*?<\/a>/i 放入组1中,然后将匹配替换为组1的内容。


这个问题已经很晚了,但我想我会把它扔进去 (注意警告!!这可能很难看......)

注 - BBEdit使用PCRE引擎。可以找到BBEdit正则表达式构造 在这里:


 # (?s)(<a(?=\s)(?>(?:(?<=\s)href\s*=\s*"\s*(\[nggallery\s+id\s*=\s*[^"\]>]*?\])"|".*?"|'.*?'|[^>]*?)+>)(?<!/>)(?(2)|(?!))).*?</a\s*>

 (                             # (1 start), Capture open a tag
      <a                            # Open a tag
      (?= \s )
      (?>                           # Atomic
                (?<= \s )
                href \s* = \s*                # href attribute
                (                             # (2 start), Capture shortcode value
                     \[nggallery \s+ 
                     id \s* = \s* [^"\]>]*? 
                )                             # (2 end)
             |  " .*? "
             |  ' .*? '
             |  [^>]*? 
      (?<! /> )                     # Not a self contained closure
      (?(2)                         # Only a tags with href attr, shortcode value
        |  (?!)
 )                             # (1 end)
 .*?                           # Stuff inbetween
 </a \s* >                     # Close a tag


 **  Grp 0 -  ( pos 0 , len 240 ) 
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
alt="" width="685" height="456" /></span></a>  
 **  Grp 1 -  ( pos 0 , len 28 ) 
<a href="[nggallery id=xx]">  
 **  Grp 2 -  ( pos 9 , len 17 ) 
[nggallery id=xx]  
 **  Grp 0 -  ( pos 244 , len 46 ) 
<a href="[nggallery id=xxxxx]">Click here!</a>  
 **  Grp 1 -  ( pos 244 , len 31 ) 
<a href="[nggallery id=xxxxx]">  
 **  Grp 2 -  ( pos 253 , len 20 ) 
[nggallery id=xxxxx]  
 **  Grp 0 -  ( pos 294 , len 90 ) 
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>  
 **  Grp 1 -  ( pos 294 , len 65 ) 
<a title="title title" href="[nggallery id=xxx]" target="_blank">  
 **  Grp 2 -  ( pos 323 , len 18 ) 
[nggallery id=xxx]  

当使用JavaScript DOM操作完成时,我不知道为什么要使用正则表达式执行此操作。


var div = document.createElement('div');
div.innerHTML = yourString;
var a = div.querySelector('a');
document.body.innerHTML = a.attributes[0].nodeValue;

使用全局和单行作为修饰符(-g和-s)。这与<a href="[nggallery id=xx]"></a>之间的所有内容相匹配。我不确定我是否正确理解你的问题......但是这个RegEx做了我刚才描述的。