我想在博客引擎XSS-safe中发表评论。尝试了很多不同的方法,但发现它非常困难。
当我显示评论时,我首先使用Microsoft AntiXss 3.0对整个事物进行html编码。然后我尝试使用白名单方法对安全标签进行html解码。
在refactormycode上查看Atwood的“Sanitize HTML”主题中的Steve Downing's example。
我的问题是AntiXss库将值编码为& #DECIMAL;记谱法,我不知道如何改写史蒂夫的例子,因为我的正则表达式知识是有限的。
我尝试了以下代码,我只是将实体替换为十进制形式,但它无法正常工作。
< with <
> with >
我的重写:
class HtmlSanitizer
{
/// <summary>
/// A regex that matches things that look like a HTML tag after HtmlEncoding. Splits the input so we can get discrete
/// chunks that start with < and ends with either end of line or >
/// </summary>
private static Regex _tags = new Regex("<(?!>).+?(>|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/// <summary>
/// A regex that will match tags on the whitelist, so we can run them through
/// HttpUtility.HtmlDecode
/// FIXME - Could be improved, since this might decode > etc in the middle of
/// an a/link tag (i.e. in the text in between the opening and closing tag)
/// </summary>
private static Regex _whitelist = new Regex(@"
^</?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$
|^<(b|h)r\s?/?>$
|^<a(?!>).+?>$
|^<img(?!>).+?/?>$",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/// <summary>
/// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using
/// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
/// </summary>
public static string Sanitize(string html)
{
string tagname = "";
Match tag;
MatchCollection tags = _tags.Matches(html);
string safeHtml = "";
// iterate through all HTML tags in the input
for (int i = tags.Count - 1; i > -1; i--)
{
tag = tags[i];
tagname = tag.Value.ToLowerInvariant();
if (_whitelist.IsMatch(tagname))
{
// If we find a tag on the whitelist, run it through
// HtmlDecode, and re-insert it into the text
safeHtml = HttpUtility.HtmlDecode(tag.Value);
html = html.Remove(tag.Index, tag.Length);
html = html.Insert(tag.Index, safeHtml);
}
}
return html;
}
}
我的输入测试html是:
<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>
AntiXss之后变成:
<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>
当我在上面运行Sanitize(字符串html)版本时,它给了我:
<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>
正则表达式是匹配我不想要的白名单中的脚本。对此的任何帮助都将受到高度赞赏。
答案 0 :(得分:1)
您是否考虑过使用Markdown或VBCode或类似的方法让用户标记他们的评论?然后你可以禁止所有的HTML。
如果你必须允许HTML,那么我会考虑使用HTML解析器(本着HTMLTidy的精神)并在那里进行白名单。
答案 1 :(得分:1)
是的我正在使用带有markdown的WMD编辑器,但我希望用户能够在Stack Overflow上发布HTML和代码示例,所以我不想完全禁止HTML。
我一直在关注HTML Tidy但尚未尝试过。然而,我使用Html Agility Pack来确保HTML是正确的(没有孤儿标签)。这是在我运行AntiXss之前完成的。
如果我不能按照自己喜欢的方式使用当前的解决方案,我会试用HTML Tidy,谢谢你的建议。
答案 2 :(得分:1)
您的问题是C#错误地解释了您的正则表达式。你需要逃离#-sign。没有逃脱它匹配太多。
private static Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$
|^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$",
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace |
RegexOptions.ExplicitCapture
RegexOptions.Compiled
);
答案 3 :(得分:0)
我在Mac上,因此无法测试您的C#代码。但对我来说,似乎你应该让_whitelist正则表达式只与标签名称一起使用。这可能意味着您必须进行两次传递,一次用于打开,一次用于关闭标记。但它会使它变得更加简单。
答案 4 :(得分:0)
如果有人有兴趣使用此代码,我将在此发布完整代码(稍加重构并附带更新的评论)。
我还决定从白名单中删除img标签,因为@Pez和@some指出这可能是危险的。
还必须指出,我没有对可能的XSS攻击进行适当的测试。这只是我对该方法运作情况的一个说明点。
class HtmlSanitizer
{
/// <summary>
/// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete
/// chunks that start with < and ends with either end of line or >
/// </summary>
private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/// <summary>
/// A regex that will match tags on the whitelist, so we can run them through
/// HttpUtility.HtmlDecode
/// FIXME - Could be improved, since this might decode < etc in the middle of
/// an a/link tag (i.e. in the text in between the opening and closing tag)
/// </summary>
private static readonly Regex _whitelist = new Regex(@"
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$
|^&\#60;a(?!&\#62;).+?&\#62;$",
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace |
RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/// <summary>
/// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using
/// a whitelist based approach, leaving the dangerous tags Encoded HTML tags
/// </summary>
public static string Sanitize(string html)
{
Match tag;
MatchCollection tags = _tags.Matches(html);
// iterate through all HTML tags in the input
for (int i = tags.Count - 1; i > -1; i--)
{
tag = tags[i];
string tagname = tag.Value.ToLowerInvariant();
if (_whitelist.IsMatch(tagname))
{
// If we find a tag on the whitelist, run it through
// HtmlDecode, and re-insert it into the text
string safeHtml = HttpUtility.HtmlDecode(tag.Value);
html = html.Remove(tag.Index, tag.Length);
html = html.Insert(tag.Index, safeHtml);
}
}
return html;
}
}