Question

我在这里测试正则表达式 - ＆gt; http://www.regexr.com/3ehda

我尝试使用此模式<a.*>*?<\/a>，但如果它有新行并且它捕获figcaption中的锚点，它就不会捕获它。

任何人都可以帮我删除除figcaption标记中的锚点以外的所有锚标记吗？

如果用正则表达式很难，也许有人可以给我一个暗示如何以其他方式解决？

Answer 1

正如你可以在任何地方阅读它一样，正则表达式不是解析html（包含太多陷阱）的可靠方法。 PHP具有解析，查询和编辑html字符串的类：

$dom = new DOMDocument;
# prevent errors for badly formatted html to be displayed and store them
libxml_use_internal_errors(true);
# parse the html content wrapped in a root tag with an xml declaration to specify
# the encoding, and build the DOM tree
$dom->loadHTML('<?xml encoding="UTF-8"?><div>' . $html . '<\div>', LIBXML_HTML_NOIMPLIED);
# clear the html errors
libxml_clear_errors();

$xp = new DOMXPath($dom);
$nodeList = $xp->query('//a[not(./ancestor::figcaption)]');

# remove the selected nodes
foreach($nodeList as $node) {
    $node->parentNode->removeChild($node);
}

# build the result string concatenating root child nodes
$result = '';

foreach($dom->documentElement->childNodes as $childNode) {
    $result .= $dom->saveHTML($childNode);
}

echo $result;

PHP：删除除了figcaption标记中的锚点之外的所有锚标记

1 个答案: