Question

我想在博客引擎XSS-safe中发表评论。尝试了很多不同的方法，但发现它非常困难。

当我显示评论时，我首先使用Microsoft AntiXss 3.0对整个事物进行html编码。然后我尝试使用白名单方法对安全标签进行html解码。

在refactormycode上查看Atwood的“Sanitize HTML”主题中的Steve Downing's example。

我的问题是AntiXss库将值编码为＆amp; #DECIMAL;记谱法，我不知道如何改写史蒂夫的例子，因为我的正则表达式知识是有限的。

我尝试了以下代码，我只是将实体替换为十进制形式，但它无法正常工作。

&lt; with &#60;
&gt; with &#62;

我的重写：

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding.  Splits the input so we can get discrete
    /// chunks that start with &lt; and ends with either end of line or &gt;
    /// </summary>
    private static Regex _tags = new Regex("&#60;(?!&#62;).+?(&#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &gt; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>
    private static Regex _whitelist = new Regex(@"
^&#60;/?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&#62;$
|^&#60;(b|h)r\s?/?&#62;$
|^&#60;a(?!&#62;).+?&#62;$
|^&#60;img(?!&#62;).+?/?&#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {

        string tagname = "";
        Match tag;
        MatchCollection tags = _tags.Matches(html);
        string safeHtml = "";

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }

        }

        return html;
    }

}

我的输入测试html是：

<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>

AntiXss之后变成：

&#60;p&#62;&#60;script language&#61;&#34;javascript&#34;&#62;alert&#40;&#39;XSS&#39;&#41;&#60;&#47;script&#62;&#60;b&#62;bold should work&#60;&#47;b&#62;&#60;&#47;p&#62;

当我在上面运行Sanitize（字符串html）版本时，它给了我：

<p><script language="javascript">alert&#40;&#39;XSS&#39;&#41;</script><b>bold should work</b></p>

正则表达式是匹配我不想要的白名单中的脚本。对此的任何帮助都将受到高度赞赏。

Answer 1

您是否考虑过使用Markdown或VBCode或类似的方法让用户标记他们的评论？然后你可以禁止所有的HTML。

如果你必须允许HTML，那么我会考虑使用HTML解析器（本着HTMLTidy的精神）并在那里进行白名单。

Answer 2

是的我正在使用带有markdown的WMD编辑器，但我希望用户能够在Stack Overflow上发布HTML和代码示例，所以我不想完全禁止HTML。

我一直在关注HTML Tidy但尚未尝试过。然而，我使用Html Agility Pack来确保HTML是正确的（没有孤儿标签）。这是在我运行AntiXss之前完成的。

如果我不能按照自己喜欢的方式使用当前的解决方案，我会试用HTML Tidy，谢谢你的建议。

Answer 3

您的问题是C＃错误地解释了您的正则表达式。你需要逃离＃-sign。没有逃脱它匹配太多。

private static Regex _whitelist = new Regex(@"
    ^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
    |^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
    |^&\#60;a(?!&\#62;).+?&\#62;$
    |^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$",

    RegexOptions.Singleline |
    RegexOptions.IgnorePatternWhitespace |
    RegexOptions.ExplicitCapture 
    RegexOptions.Compiled
 );

更新2：您可能对此xss和regexp网站感兴趣。

Answer 4

我在Mac上，因此无法测试您的C＃代码。但对我来说，似乎你应该让_whitelist正则表达式只与标签名称一起使用。这可能意味着您必须进行两次传递，一次用于打开，一次用于关闭标记。但它会使它变得更加简单。

Answer 5

如果有人有兴趣使用此代码，我将在此发布完整代码（稍加重构并附带更新的评论）。

我还决定从白名单中删除img标签，因为@Pez和@some指出这可能是危险的。

还必须指出，我没有对可能的XSS攻击进行适当的测试。这只是我对该方法运作情况的一个说明点。

class HtmlSanitizer
{
    /// <summary>
    /// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete
    /// chunks that start with &#60; and ends with either end of line or &#62;
    /// </summary>
    private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);


    /// <summary>
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode
    /// FIXME - Could be improved, since this might decode &#60; etc in the middle of
    /// an a/link tag (i.e. in the text in between the opening and closing tag)
    /// </summary>

    private static readonly Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$",


      RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
      RegexOptions.ExplicitCapture | RegexOptions.Compiled);

    /// <summary>
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
    /// </summary>
    public static string Sanitize(string html)
    {
        Match tag;
        MatchCollection tags = _tags.Matches(html);

        // iterate through all HTML tags in the input
        for (int i = tags.Count - 1; i > -1; i--)
        {
            tag = tags[i];
            string tagname = tag.Value.ToLowerInvariant();

            if (_whitelist.IsMatch(tagname))
            {
                // If we find a tag on the whitelist, run it through 
                // HtmlDecode, and re-insert it into the text
                string safeHtml = HttpUtility.HtmlDecode(tag.Value);
                html = html.Remove(tag.Index, tag.Length);
                html = html.Insert(tag.Index, safeHtml);
            }
        }
        return html;
    }
}

从AntiXSS v3输出中清除html编码的文本（#decimal表示法）

5 个答案: