使用DOMDocument查找和替换链接的关键字

时间:2016-01-16 20:24:13

标签: php domdocument

我一直在研究找到某些关键字的方法,如果这些关键字位于''' span'或者' blockquote'并使用DOMDocument用链接替换它们。我已经编写了一个实现这一目标的正则表达式,但我更倾向于使用DOMDocument,因为它应该会产生更好的解决方案。

下面的代码有两个主要问题,如果我将&放在$ html中..它会崩溃,因为&没有转义,我无法找到正确的方法逃避&

一个较小的问题,不是那么重要..但如果HTML无效,DOMDocument会尝试纠正HTML,我似乎无法阻止这种情况。

preg_replace使用数组,因为最终会使用多个关键字动态加载它。

$html = '
<blockquote>Random we random text</blockquote>
<p>We like to match text</p>
<p>This is sample text</p>';

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;

$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));

$xpath = new DOMXPath($dom);

foreach($xpath->query('//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]') as $node)
{
    $replaced = preg_replace(
        array('/(^|\s)'.preg_quote('we', '/').'(\s|$)/msi'), 
        array('<a href="#wrapped">we</a>'),
        $node->wholeText
    );
    $newNode  = $dom->createDocumentFragment();
    $newNode->appendXML($replaced);
    $node->parentNode->replaceChild($newNode, $node);
}

$result = mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");

libxml_clear_errors();

echo $result;

1 个答案:

答案 0 :(得分:0)

与&符号的问题来自于您使用appendXML($replaced)注入HTML但不会转义文本部分的<>&

主要问题是即使您使用DOMDocument来避免RegEx操作,您仍然会以较小的规模操纵HTML,从而遇到类似的问题。

这是一种避免这一切的方法。我没有维护 array 替换样式,以免使其过于复杂。我相信你会在需要时设法用其他类型的替换来扩展它:

foreach ($xpath->query(
        '//text()[not(ancestor::a)][(ancestor::p|ancestor::blockquote)]')
        as $node) {
    // Keep a reference to the parent node:
    $parent = $node->parentNode;
    // Split text (e.g. "random we random text") into parts so we 
    // can isolate the parts that must be modified.
    // e.g. into: ["random ", "we", " random text"] 
    $parts = preg_split('/\b('.preg_quote('we', '/').')\b/msi',
                          $node->textContent, 0, PREG_SPLIT_DELIM_CAPTURE);
    foreach ($parts as $index => $part) {
        if (empty($part)) continue;
        // Parts corresponding with the captured expression in the 
        // split delimiter (e.g. "we") occur at odd indexes:
        if ($index % 2) {
            // Create the anchor the DOM-way. The value that is passed
            // should be not be interpreted as HTML, so we escape it:
            $el = $dom->createElement('a', htmlentities($part));
            $el->setAttribute('href', '#wrapped');
        } else {
            // Create the text node the DOM-way. The text will be escaped
            // by the library, as it knows it should not be interpreted 
            // as HTML:
            $el = $dom->createTextNode($part);
        }
        // insert this part, before the node we are processing
        $parent->insertBefore($el, $node);
    }
    // when all parts are inserted, delete the node we split
    $parent->removeChild($node);
}

这样你就不会遇到&符问题。

注意:我无法知道您可以阻止DOMDocument“修复”无效的HTML。