Question

我正在编写ASP.NET MVC应用程序。某些HTML来自用户，其中一些来自第三方来源。有没有像HAP（Html Agility Pack）或Tidy这样的重型炮兵来清理HTML的简单而快速的方法？

我只需删除脚本，样式，<object>/<embed>，href="javascript:"，style=，onclick，我不认为删除它们手动通过.Remove / .Replace是一个很好的方法，即使使用StringBuilder。

例如，如果我有下一个输入

<html>
    <style src="http://harmyourpage.com"/>
    <script src="http://killyourdog.com"/>
    <div>
        <a href="http://co.com">Good link</a>
        <a href="javascript::harm()">Bad link</a>
        <p>Some text <b>to</b> test</p><br/>
        <h1 style="position:absolute;">Damage your layout</h1>
        And an image there <img src="http://co.com/a.jpg"/><br>
        <span onclick="harm()">Good span with bad attribute</span>
        <object>Your lovely java can be there</object>
    </div>
</html>

必须转换为下一个：

<div>
    <a href="http://co.com">Good link</a>
    <a>Bad link</a>
    <p>Some text <b>to</b> test</p><br/>
    <h1>Damage your layout</h1>
    And an image there <img src="http://co.com/a.jpg"/><br>
    <span>Good span with bad attribute</span>
</div>

那么，如何以正确的方式使用标签和反对词的白名单来做到这一点？

UPD：我尝试过StackExchange HtmlHelpers库，但它删除了div，a和img等所需的标记。

Answer 1

实现同样目标的最快方法是使用正则表达式

var regex = new Regex(
   "(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)|(\\<object(.+?)\\</object\\>)", 
   RegexOptions.Singleline | RegexOptions.IgnoreCase
);

string ouput = regex.Replace(input, "");

您也可以使用Microsoft Web Protection Library（http://wpl.codeplex.com/），例如

Sanitizer.GetSafeHtmlFragment(input);

从C＃中的恶意数据中清除原始HTML

1 个答案: