Trying to figure out a Regular Expression gives me a brain cramp :)
I'm replacing thousands of individual href
links with an individual shortcode in WordPress post content using a plugin that allows me to run regular expressions on content.
Rather than try and combine an SQL query with a RegEx, I'm doing it in two stages: first the SQL to find/replace each individual URL to the individual shortcode, and the second stage, remove the rest of the 'href` link markup.
These are some examples of what I have now from the first step; as you can see, the URL has been replaced with the [nggallery id=xxx]
shortcode.
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title"
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>
<a href="[nggallery id=xxxxx]">Click here!</a>
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>
Now, I need to delete all the href
link markup - span
, img
, etc - in between the leading <a
and ending </a>
, leaving just the shortcode [nggallery id=xxx]
.
I've got a start here: https://www.regex101.com/r/rL8wP1/2
But I don't know how to prevent the [nggallery id=xxx]
shortcode from being captured in the RegEx.
Update 7/09/2015
@nhahtdh's answer appears to work perfectly, is not too greedy, and doesn't eat adjacent html links. Use (
and )
as delimiters and $1
as a replacement with a regex plugin in WordPress. (If using BBEdit, you will need to use \1
)
( <a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a> )
Update 7/02/2015
Thanks to Fab Sa (answer below), his regex at https://www.regex101.com/r/rL8wP1/4
<a.*(\[nggallery[^\]+]*\]).*?<\/a>
works in the regex101 emulator, but when used in the BBEdit text editor or the WordPress plugin that runs regex, his regex deletes the [nggallery id=***]
shortcode. So is it too greedy? Some other issue?
Update 7/01/2015:
I know, I know, re: RegEx match open tags except XHTML self-contained tags YOU CANNOT PARSE HTML WITH REGEX
答案 0 :(得分:7)
您可以使用此正则表达式
<a.*(\[nggallery[^\]+]*\]).*?<\/a>
全局(标记 g )。此正则表达式将匹配链接并保存[nggallery ...]
部分。您可以使用$ 1替换所有匹配项以保留已保存的[nggallery ...]
部分。
我已在线更新您的正则表达式:https://www.regex101.com/r/rL8wP1/4
PS:在此解决方案[nggallery ...]
中,不需要处于 href 等特定属性中。如果您想强制执行此操作,可以使用<a.*href\="(\[nggallery[^\]+]*\])".*?<\/a>
答案 1 :(得分:7)
Fab Sa的正则表达式<a.*(\[nggallery[^\]+]*\]).*?<\/a>
在单行上有多个<a>
标记时会占用所有内容,因为开头的.*
不受限制,会匹配不同的<a>
代码。
通过限制允许的字符,您可以在某种程度上匹配您想要的字符:
<a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a>
^^^^^^^
我在a
之后强制至少有一个空格,以确保它与其他标签不匹配,加上一些额外的限制。
无论如何,如果你发现它在一些极端情况下不起作用,你就是独立的。用正则表达式操纵HTML通常是一个坏主意。
答案 2 :(得分:5)
是的,你不能用正则表达式解析html,如何用简约的lexer-parser使这个行为变得无懈可击? 它可以让您更灵活地控制代码。
<?php
$src = <<<EOF
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title"
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>
<a href="[nggallery id=xxxxx]">Click here!</a>
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>
EOF;
// we "eat up" the source string by opening <a> tags, closing <a> tags or text
$tokens = array();
while ($src){
// check if $src begins with this pattern <a (any optional prop)[nggallery (any string)] (any optional prop)>
if (preg_match('/^<a [^>]*(\[nggallery [^\]]+\])[^>]*>/s', $src, $match)){
// here you can handle data with more flexibility
// you can grab the id or the [placeholder] via
//$match[1] = [nggallery id=xyz]
// we store the chunk of string and label it as an opening tag
$tokens[] = array('type' => 'OPENING_A', 'value' => $match[0]);
}else if (preg_match('/^<\/a>/s', $src, $match)){
// we store the chunk of string and label it as a closing tag
$tokens[] = array('type' => 'CLOSING_A', 'value' => $match[0]);
}else if (preg_match('/^./s', $src, $match)){
// we store the chunk of string, in this case a character and label it as text
$tokens[] = array('type' => 'TEXT', 'value' => $match[0]);
}
// finally we remove the identified pattern from the source string
// and continue "eating it up"
$src = substr($src, strlen($match[0]));
}
// once the source string has been consumed, we get this array
// var_dump($tokens);
// array (size=247)
// 0 =>
// array (size=2)
// 'type' => string 'OPENING_A' (length=9)
// 'value' => string '<a href="[nggallery id=xx]">' (length=28)
// 1 =>
// array (size=2)
// 'type' => string 'TEXT' (length=4)
// 'value' => string '<' (length=1)
// 2 =>
// array (size=2)
// 'type' => string 'TEXT' (length=4)
// 'value' => string 's' (length=1)
// 3 =>
// array (size=2)
// 'type' => string 'TEXT' (length=4)
// 'value' => string 'p' (length=1)
// ... ommited for brevity
// now with all the parsed data, we can rebuild the html
// as needed
$html = '';
// we keep a flag to now if we are inside a tag
// marked with ngggallery
$insideNGGalleryTag = false;
foreach ($tokens as $token){
if ($token['type'] == 'OPENING_A'){
$insideNGGalleryTag = true;
$html .= $token['value'];
}else if ($token['type'] == 'CLOSING_A'){
$insideNGGalleryTag = false;
$html .= $token['value'];
}else{
// if we are inside a nggallery tag, we will ignore
// all text inside it. here you could also remove
// html properties from the tag, move the [nggallery placeholder]
// inside the <a> or some other behavior you might need
if (!$insideNGGalleryTag){
$html .= $token['value'];
}
}
}
// finally echo or write to file the
// modified html, in this case it would return
var_dump($html);
// <a href="[nggallery id=xx]"></a>
// <a href="[nggallery id=xxxxx]"></a>
// <a title="title title" href="[nggallery id=xxx]" target="_blank"></a>
答案 3 :(得分:1)
subprocess.call()
这会将短代码/<a\b[^>]*href\s*=\s*"(\[nggallery id=[^"]+\])".*?<\/a>/i
放入组1中,然后将匹配替换为组1的内容。
注意:这假设HTML格式合理,通常的免责声明适用。
答案 4 :(得分:1)
这个问题已经很晚了,但我想我会把它扔进去 (注意警告!!这可能很难看......)
修改:用于BBEdit。
注 - BBEdit使用PCRE引擎。可以找到BBEdit正则表达式构造
在这里:https://gist.github.com/ccstone/5385334
# (?s)(<a(?=\s)(?>(?:(?<=\s)href\s*=\s*"\s*(\[nggallery\s+id\s*=\s*[^"\]>]*?\])"|".*?"|'.*?'|[^>]*?)+>)(?<!/>)(?(2)|(?!))).*?</a\s*>
(?s)
( # (1 start), Capture open a tag
<a # Open a tag
(?= \s )
(?> # Atomic
(?:
(?<= \s )
href \s* = \s* # href attribute
"
\s*
( # (2 start), Capture shortcode value
\[nggallery \s+
id \s* = \s* [^"\]>]*?
\]
) # (2 end)
"
| " .*? "
| ' .*? '
| [^>]*?
)+
>
)
(?<! /> ) # Not a self contained closure
(?(2) # Only a tags with href attr, shortcode value
| (?!)
)
) # (1 end)
.*? # Stuff inbetween
</a \s* > # Close a tag
输出:
** Grp 0 - ( pos 0 , len 240 )
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title"
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>
** Grp 1 - ( pos 0 , len 28 )
<a href="[nggallery id=xx]">
** Grp 2 - ( pos 9 , len 17 )
[nggallery id=xx]
----------------
** Grp 0 - ( pos 244 , len 46 )
<a href="[nggallery id=xxxxx]">Click here!</a>
** Grp 1 - ( pos 244 , len 31 )
<a href="[nggallery id=xxxxx]">
** Grp 2 - ( pos 253 , len 20 )
[nggallery id=xxxxx]
-----------------
** Grp 0 - ( pos 294 , len 90 )
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>
** Grp 1 - ( pos 294 , len 65 )
<a title="title title" href="[nggallery id=xxx]" target="_blank">
** Grp 2 - ( pos 323 , len 18 )
[nggallery id=xxx]
答案 5 :(得分:1)
这是一个与您的示例完美匹配的正则表达式。
(<a.*?href=")|([^\]]*?<\/a>)
我没有尝试一次匹配整个表达式,而是使用OR运算符指定两个单独的正则表达式,一个用于a
标记的开头,<a.*?href="
,另一个用于结束标签[^\]]*?<\/a>
。这可能会或可能不会在单个替换操作中工作,如果没有,将其拆分为两个替换操作,首先运行一个用于结束标记正则表达式,然后运行一个用于启动标记。如果您有任何其他示例可以打破这个答案,请告诉我。
答案 6 :(得分:1)
当使用JavaScript DOM操作完成时,我不知道为什么要使用正则表达式执行此操作。
我会告诉你基本的方法,给你一个想法:
var div = document.createElement('div');
div.innerHTML = yourString;
var a = div.querySelector('a');
document.body.innerHTML = a.attributes[0].nodeValue;
<强> Working Fiddle 强>
同时检查documentFragment
答案 7 :(得分:1)
由于你没有指定,我假设没有嵌套的锚标签,你只想提取那里的方形括号代码。我还假设你的代码的识别格式是“[nggallery”。
使用此
查找<\s*a(?=\s|>)[^>]*?(\[nggallery[^\]]+\])[^>]*>(.|\n)+?(<\s*\/\s*a\s*>)
使用
替换\1
(这应该是BBEdit的第一个捕获组表示法)
答案 8 :(得分:0)
这是怎么回事?
(?<=nggallery\sid=xx]">).*(?=<\/a>)
使用全局和单行作为修饰符(-g和-s)。这与<a href="[nggallery id=xx]">
和</a>
之间的所有内容相匹配。我不确定我是否正确理解你的问题......但是这个RegEx做了我刚才描述的。