Question

我有一个函数，使用Php的DOMDocument替换字符串中的锚点'href属性。这是一个片段：

$doc        = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

问题是loadHTML（$ text）包含doctype，html，body等标签中的$ text。我尝试通过这样做而不是loadHTML（）来解决这个问题：

$doc        = new DOMDocument('1.0', 'UTF-8');
$node       = $doc->createTextNode($text);
$doc->appendChild($node);
...

不幸的是，这会编码所有实体（包括锚点）。有谁知道如何关闭它？我已经彻底浏览了文档并试图将其破解，但无法弄明白。

谢谢！：）

Answer 1

$ text是带有占位符锚标记的翻译字符串

如果这些占位符具有严格的，定义明确的格式，则可以使用简单的preg_replace或preg_replace_callback。我不建议使用正则表达式来处理html文档，但对于一个定义良好的小子集，它们是合适的。

Answer 2

XML只有very few predefined entities。所有html实体都在其他地方定义。当你使用loadhtml（）时，这些实体定义是自动加载的，而loadxml（）（或根本没有load（））它们不是。
createTextNode（）正如名称所暗示的那样。作为值传递的所有内容都被视为文本内容，而不是标记。即如果你将具有特殊含义的东西传递给标记（＆lt;，＆gt;，...），它的编码方式是解析器可以将文本与实际标记区分开来（＆amp; lt;，＆amp; gt; ,. ..）

$ text来自哪里？你不能在实际的html文档中进行替换吗？

Answer 3

我最终以一种微妙的方式破解了这一点，改变了：

return $doc->saveHTML();

成：

$text       = $doc->saveHTML();
return mb_substr($text, 122, -19);

这会消除所有不必要的垃圾，改变这个：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>
You can <a href="http://www.google.com">click here</a> to visit Google.</p>
</body></html>

进入这个：

You can <a href="http://www.google.com">click here</a> to visit Google.

任何人都可以找到更好的东西吗？

Answer 4

好的，这是我最终的最终解决方案。决定采用VolkerK的建议。

public static function ReplaceAnchors($text, array $attributeSets)
{
    $expression = '/(<a)([\s\w\d:\/=_&\[\]\+%".?])*(>)/';

    if (empty($attributeSets) || !is_array($attributeSets)) {
        // no attributes to set. Set href="#".
        return preg_replace($expression, '$1 href="#"$3', $text);
    }

    $attributeStrs  = array();
    foreach ($attributeSets as $attributeKeyVal) {
        // loop thru attributes and set the anchor
        $attributePairs = array();
        foreach ($attributeKeyVal as $name => $value) {
            if (!is_string($value) && !is_int($value)) {
                continue; // skip
            }

            $name               = htmlspecialchars($name);
            $value              = htmlspecialchars($value);
            $attributePairs[]   = "$name=\"$value\"";
        }
        $attributeStrs[]    = implode(' ', $attributePairs);
    }

    $i      = -1;
    $pieces = preg_split($expression, $text);
    foreach ($pieces as &$piece) {
        if ($i === -1) {
            // skip the first token
            ++$i;
            continue;
        }

        // figure out which attribute string to use
        if (isset($attributeStrs[$i])) {
            // pick the parallel attribute string
            $attributeStr   = $attributeStrs[$i];
        } else {
            // pick the last attribute string if we don't have enough
            $attributeStr   = $attributeStrs[count($attributeStrs) - 1];
        }

        // build a opening new anchor for this token.
        $piece  = '<a '.$attributeStr.'>'.preg_replace($expression, '$1 href="#"$3', $piece);
        ++$i;
    }

    return implode('', $pieces);

这允许用一组不同的锚属性调用该函数。

如何防止Php的DOMDocument编码html实体？

4 个答案: