我正在尝试解析一些html并删除不必要的重复链接。例如,我想要以下代码:
<p>
Lorem ipsum amet
<a href="http://edition.cnn.com/">
Proin lacinia posuere
</a>
sit ipsum.
</p>
<p>
<a href="http://www.google.com/articles/blah">
[caption align="alignright"]
<a href="http://www.google.com/articles/blah">
<img src="http://hoohlr.dev/Picture-142-300x222.png" alt="Blah blah/Flickr " height="222" class="size-medium wp-image-4351" />
</a>
sociis magnis [/caption]
</a>
</p>
要转换为此(删除[标题]之前的链接以及结束标记:
<p>
Lorem ipsum amet
<a href="http://edition.cnn.com/">
Proin lacinia posuere
</a>
sit ipsum.
</p>
<p>
[caption align="alignright"]
<a href="http://www.google.com/articles/blah">
<img src="http://hoohlr.dev/Picture-142-300x222.png" alt="Blah blah/Flickr " height="222" class="size-medium wp-image-4351" />
</a>
sociis magnis [/caption]
</p>
删除的链接应始终位于[标题]之前。任何善于使用正则表达式的人都能帮助我使用php preg_replace(或更简单的方法)吗?
我会非常感激。谢谢!
编辑:好的,我已经很好地尝试了我正在寻找的东西。 http://regexr.com?31t05和http://regexr.com?31svv尝试将其发布为网站的回答不会让我......任何人都可以改进它吗?
答案 0 :(得分:0)
此测试脚本适用于您的测试数据:
<?php // test.php Rev:20120820_2200
function stripNestedAnchorTags($text) {
$re = '%
# Match (invalid) outer A element containing inner A element.
<a\b[^<>]+>\s* # Outer A element start tag (and ws).
( # $1: contents of outer A element.
[^<]*(?:<(?!/?a\b)[^<]*)* # Everything up to inner <a>
<a\b[^<>]+> # Inner A element start tag.
[^<]*(?:<(?!/?a\b)[^<]*)* # Everything up to inner </a>
</a> # Inner A element end tag.
[^<]*(?:<(?!/?a\b)[^<]*)* # Everything up to outer </a>
) # End $1: contents of outer A.
</a>\s* # Outer A element end tag (and ws).
%ix';
while(preg_match($re, $text))
$text = preg_replace($re, '$1', $text);
return $text;
}
$idata = file_get_contents('testdata.html');
$odata = stripNestedAnchorTags($idata);
file_put_contents('testdata_out.html', $odata);
?>