Question

所以我设置了一个人们可以提交教程的页面。这些教程基本上由TinyMCE编辑器构建。

无论如何，有人可能会滥用它，只是发布自己的非转义文本并插入一些恶意<script>。

所以我的问题是：用正则表达式删除<script>标签是否足够安全？在存储它之前，我会在我的后端运行这个正则表达式。

I've found这个表达式例如

<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>

Answer 1

没有。他们可以使用多字节字符来绕过你的正则表达式，或者偷偷摸摸地使用不匹配的开始和结束标签的组合，创建虚假的关闭脚本标签，在属性中引用它们等等.Don＆＃39;尝试使用RegEx解析可能有噪声/格式错误的HTML，使用旨在解决此类问题的HTML解析引擎。请参阅以下有关使用regex解析HTML的着名答案：RegEx match open tags except XHTML self-contained tags

如果你正在寻找一个，我发誓这个PHP库：http://simplehtmldom.sourceforge.net/
它首先通过将噪声转换为实体来清理文档，然后再考虑＆＃34; script＆＃34;，＆＃34; style＆＃34;和＆＃34; textarea＆＃34;在开始和结束标记之间找到的任何元素都是文本而不是HTML。然后它将结果解析为DOM结构，可以解析很多，就像使用JavaScript中的DOM方法解析文档一样。它带有一个＆＃34; save＆＃34;方法，（将产生字符串），所以在您完成页面中的剥离标记后，您将拥有修改后的格式良好的文档。我已经使用大数据测试过的库，当我使用regexp之前使用regexp无法达到PHP内存限制时，这个库在没有内存问题的情况下解析了这些文档。所以我已经对它进行了彻底的测试，并且在大型项目中使用它之前，它从未让我失望 - 就像内置的PHP函数/类具有格式错误的数据一样。

修改由于我投了一票，我想我应该举一个例子来解决它：

<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>

仅仅因为jQuery使用正则表达式，并不能使服务器安全。

即使您使用了＆＃34; gi＆＃34;旗帜，它并不重要：

var str="<scr<script>ipt></scr</script>ipt>alert('XSS!')</script>"; str=str.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi,''); //the "g" flag doesn't help here since you need to start from the beginning, not continue in the middle alert(str);

但如果你在循环中使用它，而不是使用＆＃34; g＆＃34;旗帜，你将摆脱我提出的这个案子。

编辑2： 如果目的是清除所有JavaScript问题的用户输入，例如＆＃34; onload＆＃34;和＆＃34; onclick＆＃34;属性，为什么重新发明轮子？有http://htmlpurifier.org/（参见demo）

Answer 2

而不是正则表达式，为什么不使用DOM呢？

$content = "<h1>title</h1><p> test <span>1<!-- regular comment --><script> my script</script></span><script> my script</script></p><script> my script</script> <!--[if IE]><script>alert('XSS');</script><![endif]-->";

// creates a DOMDocument based on your string (without doctype, html and another extra tags), and wraps it in a div
$dom = new DOMDocument();
$dom->loadHTML("<div>{$content}</div>", LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

//Removing any comments or conditional comments
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

// function to remove any tag
function verifyNodes(DOMNode $node) {
    $removedTags = ['script', 'iframe']; // what tags i want to remove

    foreach ($node->childNodes as $childNode)
    {
        if (in_array($childNode->nodeName, $removedTags)) {
            $childNode->parentNode->removeChild($childNode);
        } elseif ($childNode->hasChildNodes()) {
            verifyNodes($childNode);
        }
    }
}

// calling verifyNodes
verifyNodes($dom);

// get all the content of my first div, and print it
$newContent = $dom->getElementsByTagName('div')->item(0);
foreach ($newContent->childNodes as $childNode) {
    var_dump($dom->saveHTML($childNode));
}

就像我使用nodeName来验证标记的名称一样，如果我们想删除其他内容，我们也可以使用nodeType。（检查节点XML常量列表）。

Answer 3

如果您可以使用支持 atomic 组的引擎，则可能会这样工作。这将最密切地解析浏览器如何解析脚本
标签。

查找：
(?><script(?:(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)|/)>)(?<=/>)|(?><script(?:\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+)?>)(?<!/>)[\S\s]*?</script\s*>

替换：空字符串

格式化：

    # If script tags can be <script .... />
    (?>
         <
         script 
         (?:
              (?:
                   \s+ 
                   (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
              )
           |  / 
         )
         > 
    )
    (?<= /> )
 |  
    # Or, if script tags with content can be <script .... > ... </script>
    (?>
         <
         script 
         (?:
              \s+ 
              (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
         )?
         > 
    )
    (?<! /> )
    [\S\s]*? 
    </script \s* >

恶意代码注入：通过正则表达式删除脚本标记是否足够安全？

3 个答案: