来自xpath查询的元标记内容值是否可信?

时间:2015-05-21 12:32:09

标签: php xpath

我有一个php函数,它使用xpath查询从URL中提取元标记。

例如$xpath->query('/html/head/meta[@name="my_target"]/@content')

我的问题:

我可以信任返回的值还是应该验证它?

=>有没有可能的XSS漏洞?

=>在将html内容加载到DOMDocument

之前,是否应该对其进行简化
 // Other way to say it with some code :

    $doc = new DOMDocument;
    $doc->preserveWhiteSpace = false;
    libxml_use_internal_errors(true);

    // is
    $doc->loadHTMLFile($url);
    // trustable ??

    // or is
    file_get_contents($url);
    $trust = $purifier->purify($html);
    $doc->loadHTML($trust);
    // a better practice ??

    libxml_use_internal_errors(false);
    $xpath = new DOMXPath($doc);

    $trustable = $xpath->query('/html/head/meta[@name="my_target"]/@content')->item(0) // ?

=====更新=========================================

是的,永远不要相信外部资源。

使用$be_sure = htmlspecialchars($trustable->textContent)strip_tags($trustable->textContent)

1 个答案:

答案 0 :(得分:0)

如果你从一个你无法控制的来源中提取HTML内容,那么是的,我会认为这段代码可能很麻烦!

您可以使用htmlspecialchars()将任何特殊字符转换为HTML实体。或者,如果您想保留部分标记,可以使用strip_tags()。另一种选择是使用filter_var(),这使您可以更好地控制其过滤。

或者您可以使用像HTML Purifier这样的库,但这可能对您而言太过分了。这一切都取决于您正在使用的内容类型。

现在,要清理元素,首先需要获取XPath结果的字符串表示形式。应用您的过滤,然后将其重新放入。以下示例应该执行您想要的操作:

<?php
// The following HTML is what you fetch from your remote source:
$html = <<<EOL
<html>
 <body>
    <h1>Foo, bar!</h1>
    <div id="my-target">
        Here is some <strong>text</strong> <script>javascript:alert('some malicious script!');</script> that we want to sanitize.
    </div>
 </body>
</html>
EOL;

// We instantiate a DOCDocument so we can work with it:
$original = new DOMDocument("1.0", 'UTF-8');
$original->formatOutput = true;
$original->loadHTML($html);

$body = $original->getElementsByTagName('body')->item(0);

// Find the element we need using Xpath:
$xpath = new DOMXPath($original);
$divs  = $xpath->query("//body/div[@id='my-target']");

// The XPath query will return DOMElement objects, so create a string that we can manipulate out of it:
$innerHTML  = '';
if (count($divs))
{
    $div = $divs->item(0);

    // Now get the innerHTML for this element
    foreach ($div->childNodes as $child) {
        $innerHTML .= $original->saveXML($child);
    }

    // Remove it from the original document because we want to replace it anyway
    $div->parentNode->removeChild($div);
}

// Sanitize our string by removing all tags except <strong> and the container <div>
$innerHTML = strip_tags($innerHTML, '<strong>');
// or htmlspecialchars() or filter_var or HTML Purifier ..

// Now re-import the sanitized string into a blank DOMDocument
$sanitized = new DOMDocument("1.0", 'UTF-8');
$sanitized->formatOutput = true;
$sanitized->loadXML('<div id="my-target">' . $innerHTML . '</div>');

// Now add the sanitized DOMElement back into the original document as a child of <body>
$body->appendChild($original->importNode($sanitized->documentElement, true));

echo $original->saveHTML();

希望有所帮助。