Question

我需要使用一些HTML标记保存一些数据，因此我无法对所有文本使用strip_tags而我无法使用htmlentities，因为文本必须由标记修改。为了防止其他用户使用XSS，我必须从标记内部删除任何JavaScript。

这样做的最佳方式是什么？

Answer 1

如果您需要在数据库中保存HTML标记，并且后者希望将其打印回浏览器，则使用内置的PHP函数没有100％安全的方法来实现这一点。当没有html标签时很容易，当你只有文本时，你可以使用内置的PHP函数来清除文本。

有些功能可以从文本中清除XSS，但它们不是100％安全的，并且总有一种方法可以让XSS不被注意。你的正则表达式的例子很好，但是如果我使用的话可以说< script>alert('xss')</script>，或者正则表达式可能会错过并且浏览器会执行的其他组合。

执行此操作的最佳方法是使用HTML Purifier

之类的内容

另请注意，安全性有两个级别，第一个是事情进入数据库，第二个是数据库出来时。

希望这有帮助！

Answer 2

我建议您使用DOMDocument（使用loadHTML）来加载所述HTML，删除所有类型的标记以及您不希望看到的每个属性，并保存回HTML（使用{ {1}}或saveXML）。您可以通过递归迭代文档根目录的子项，并用内部内容替换不需要的标记来实现。由于saveHTML以与浏览器类似的方式加载代码，因此比使用正则表达式更安全。

编辑以下是我所做的“净化”功能：

loadHTML

您可以使用不安全的HTML字符串和预定义的标签和属性白名单来调用<?php function purifyNode($node, $whitelist) { $children = array(); // copy childNodes since we're going to iterate over it and modify the collection foreach ($node->childNodes as $child) $children[] = $child; foreach ($children as $child) { if ($child->nodeType == XML_ELEMENT_NODE) { purifyNode($child, $whitelist); if (!isset($whitelist[strtolower($child->nodeName)])) { while ($child->childNodes->length > 0) $node->insertBefore($child->firstChild, $child); $node->removeChild($child); } else { $attributes = $whitelist[strtolower($child->nodeName)]; // copy attributes since we're going to iterate over it and modify the collection $childAttributes = array(); foreach ($child->attributes as $attribute) $childAttributes[] = $attribute; foreach ($childAttributes as $attribute) { if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value)) $child->removeAttribute($attribute->name); } } } } } function purifyHTML($html, $whitelist) { $doc = new DOMDocument(); $doc->loadHTML($html); // make sure <html> doesn't have any attributes while ($doc->documentElement->hasAttributes()) $doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0)); purifyNode($doc->documentElement, $whitelist); $html = $doc->saveHTML(); $fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html> return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1 } ?>。白名单格式为'tag'=＆gt; array（'attribute'=＆gt;'regex'）。白名单中不存在的标记将被剥离，其内容在父标记中内联。白名单中给定标签不存在的属性也会被删除;

还删除了白名单中存在但与正则表达式不匹配的属性

以下是一个例子：

purifyHTML

结果是：

<?php

$html = <<<HTML
<p>This is a paragraph.</p>
<p onclick="alert('xss')">This is an evil paragraph.</p>
<p><a href="javascript:evil()">Evil link</a></p>
<p><script>evil()</script></p>
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
HTML;

// whitelist format: tag => array(attribute => regex)
$whitelist = array(
    'b' => array(),
    'i' => array(),
    'u' => array(),
    'p' => array(),
    'img' => array('src' => '@\Ahttp://.+\Z@', 'alt' => '@.*@'),
    'a' => array('href' => '@\Ahttp://.+\Z@')
);

$purified = purifyHTML($html, $whitelist);
echo $purified;

?>

显然，您不希望允许任何This is a paragraph. This is an evil paragraph. <a>Evil link</a> evil() This is an evil image: <img> This is nice bold text. This is a nice image: <img src="http://example.org/image.png" alt="Nice image">属性，因为behavior等奇怪的专有属性，我会反对on*。确保使用匹配完整字符串（style）的正确正则表达式验证所有网址属性。

Answer 3

如果要允许特定标记，则必须解析HTML。

为此目的已经有一个很好的库：HTML Purifier（LGPL下的Opensource）

Answer 4

我为此编写了此代码，您可以设置标记列表和删除属性

function RemoveTagAttribute($Dom,$Name){
    $finder = new DomXPath($Dom);
    if(!is_array($Name))$Name=array($Name);
    foreach($Name as $Attribute){
        $Attribute=strtolower($Attribute);
        do{
          $tag=$finder->query("//*[@".$Attribute."]");
          //print_r($tag);
          foreach($tag as $T){
            if($T->hasAttribute($Attribute)){
               $T->removeAttribute($Attribute);
            }
          }
        }while($tag->length>0);  
    }
    return $Dom;

}
function RemoveTag($Dom,$Name){
    if(!is_array($Name))$Name=array($Name);
    foreach($Name as $tagName){
        $tagName=strtolower($tagName);
        do{
          $tag=$Dom->getElementsByTagName($tagName);
          //print_r($tag);
          foreach($tag as $T){
            //
            $T->parentNode->removeChild($T);
          }
        }while($tag->length>0);
    }
    return $Dom;

}

示例：

  $dom= new DOMDocument; 
   $HTML = str_replace("&", "&amp;", $HTML);  // disguise &s going IN to loadXML() 
  // $dom->substituteEntities = true;  // collapse &s going OUT to transformToXML() 
   $dom->recover = TRUE;
   @$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML); 
   // dirty fix
   foreach ($dom->childNodes as $item)
    if ($item->nodeType == XML_PI_NODE)
      $dom->removeChild($item); // remove hack
   $dom->encoding = 'UTF-8'; // insert proper
  $dom=RemoveTag($dom,"script");
  $dom=RemoveTagAttribute($dom,array("onmousedown","onclick"));
  echo $dom->saveHTML();

我怎么能100％确定HTML标签内的JS？

4 个答案: