我一直在研究找到某些关键字的方法,如果这些关键字位于''' span'或者' blockquote'并使用DOMDocument用链接替换它们。我已经编写了一个实现这一目标的正则表达式,但我更倾向于使用DOMDocument,因为它应该会产生更好的解决方案。
下面的代码有两个主要问题,如果我将&
放在$ html中..它会崩溃,因为&
没有转义,我无法找到正确的方法逃避&
。
一个较小的问题,不是那么重要..但如果HTML无效,DOMDocument会尝试纠正HTML,我似乎无法阻止这种情况。
preg_replace使用数组,因为最终会使用多个关键字动态加载它。
$html = '
<blockquote>Random we random text</blockquote>
<p>We like to match text</p>
<p>This is sample text</p>';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]') as $node)
{
$replaced = preg_replace(
array('/(^|\s)'.preg_quote('we', '/').'(\s|$)/msi'),
array('<a href="#wrapped">we</a>'),
$node->wholeText
);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
$result = mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
libxml_clear_errors();
echo $result;
答案 0 :(得分:0)
与&符号的问题来自于您使用appendXML($replaced)
注入HTML但不会转义文本部分的<
,>
或&
。
主要问题是即使您使用DOMDocument来避免RegEx操作,您仍然会以较小的规模操纵HTML,从而遇到类似的问题。
这是一种避免这一切的方法。我没有维护 array 替换样式,以免使其过于复杂。我相信你会在需要时设法用其他类型的替换来扩展它:
foreach ($xpath->query(
'//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]')
as $node) {
// Keep a reference to the parent node:
$parent = $node->parentNode;
// Split text (e.g. "random we random text") into parts so we
// can isolate the parts that must be modified.
// e.g. into: ["random ", "we", " random text"]
$parts = preg_split('/\b('.preg_quote('we', '/').')\b/msi',
$node->textContent, 0, PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $index => $part) {
if (empty($part)) continue;
// Parts corresponding with the captured expression in the
// split delimiter (e.g. "we") occur at odd indexes:
if ($index % 2) {
// Create the anchor the DOM-way. The value that is passed
// should be not be interpreted as HTML, so we escape it:
$el = $dom->createElement('a', htmlentities($part));
$el->setAttribute('href', '#wrapped');
} else {
// Create the text node the DOM-way. The text will be escaped
// by the library, as it knows it should not be interpreted
// as HTML:
$el = $dom->createTextNode($part);
}
// insert this part, before the node we are processing
$parent->insertBefore($el, $node);
}
// when all parts are inserted, delete the node we split
$parent->removeChild($node);
}
这样你就不会遇到&符问题。
注意:我无法知道您可以阻止DOMDocument“修复”无效的HTML。