我有一个php函数,它使用xpath查询从URL中提取元标记。
例如$xpath->query('/html/head/meta[@name="my_target"]/@content')
我的问题:
我可以信任返回的值还是应该验证它?
=>有没有可能的XSS漏洞?
=>在将html内容加载到DOMDocument
?
// Other way to say it with some code :
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
libxml_use_internal_errors(true);
// is
$doc->loadHTMLFile($url);
// trustable ??
// or is
file_get_contents($url);
$trust = $purifier->purify($html);
$doc->loadHTML($trust);
// a better practice ??
libxml_use_internal_errors(false);
$xpath = new DOMXPath($doc);
$trustable = $xpath->query('/html/head/meta[@name="my_target"]/@content')->item(0) // ?
=====更新=========================================
是的,永远不要相信外部资源。
使用$be_sure = htmlspecialchars($trustable->textContent)
或strip_tags($trustable->textContent)
答案 0 :(得分:0)
如果你从一个你无法控制的来源中提取HTML内容,那么是的,我会认为这段代码可能很麻烦!
您可以使用htmlspecialchars()将任何特殊字符转换为HTML实体。或者,如果您想保留部分标记,可以使用strip_tags()。另一种选择是使用filter_var(),这使您可以更好地控制其过滤。
或者您可以使用像HTML Purifier这样的库,但这可能对您而言太过分了。这一切都取决于您正在使用的内容类型。
现在,要清理元素,首先需要获取XPath结果的字符串表示形式。应用您的过滤,然后将其重新放入。以下示例应该执行您想要的操作:
<?php
// The following HTML is what you fetch from your remote source:
$html = <<<EOL
<html>
<body>
<h1>Foo, bar!</h1>
<div id="my-target">
Here is some <strong>text</strong> <script>javascript:alert('some malicious script!');</script> that we want to sanitize.
</div>
</body>
</html>
EOL;
// We instantiate a DOCDocument so we can work with it:
$original = new DOMDocument("1.0", 'UTF-8');
$original->formatOutput = true;
$original->loadHTML($html);
$body = $original->getElementsByTagName('body')->item(0);
// Find the element we need using Xpath:
$xpath = new DOMXPath($original);
$divs = $xpath->query("//body/div[@id='my-target']");
// The XPath query will return DOMElement objects, so create a string that we can manipulate out of it:
$innerHTML = '';
if (count($divs))
{
$div = $divs->item(0);
// Now get the innerHTML for this element
foreach ($div->childNodes as $child) {
$innerHTML .= $original->saveXML($child);
}
// Remove it from the original document because we want to replace it anyway
$div->parentNode->removeChild($div);
}
// Sanitize our string by removing all tags except <strong> and the container <div>
$innerHTML = strip_tags($innerHTML, '<strong>');
// or htmlspecialchars() or filter_var or HTML Purifier ..
// Now re-import the sanitized string into a blank DOMDocument
$sanitized = new DOMDocument("1.0", 'UTF-8');
$sanitized->formatOutput = true;
$sanitized->loadXML('<div id="my-target">' . $innerHTML . '</div>');
// Now add the sanitized DOMElement back into the original document as a child of <body>
$body->appendChild($original->importNode($sanitized->documentElement, true));
echo $original->saveHTML();
希望有所帮助。