Question

我在我的网站上使用了GetSafeHtmlFragment，发现除了<p>和<a>之外的所有标记都被删除了。

我研究过，我发现微软没有解决它。

是否有任何被取代或有任何解决方案？

感谢。

Answer 1

令人惊讶的是，4.2.1版本中的微软对4.2 XSS库中的安全漏洞进行了过度补偿，现在仍然没有在一年后更新。当我读某人在某个地方发表评论时，GetSafeHtmlFragment方法应该已重命名为StripHtml。

我最终使用HtmlSanitizer library中建议的this related SO issue。我喜欢它通过NuGet作为包提供。

这个库基本上实现了现在接受的答案使用的白名单方法的变体。但它基于CsQuery而不是HTML Agility库。该软件包还提供了一些其他选项，例如能够保存样式信息（例如HTML属性）。使用这个库导致我的项目中的代码如下所示，至少 - 代码比接受的答案少得多:)。

using Html;

...

var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags = new List<string> { "p", "ul", "li", "ol", "br" };
string sanitizedHtml  = sanitizer.Sanitize(htmlString);

Answer 2

另一种解决方案是将Html Agility Pack与您自己的代码白名单结合使用：

using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;
using HtmlAgilityPack;

class Program
{
    static void Main(string[] args)
    {
        var whiteList = new[] 
            { 
                "#comment", "html", "head", 
                "title", "body", "img", "p",
                "a"
            };
        var html = File.ReadAllText("input.html");
        var doc = new HtmlDocument();
        doc.LoadHtml(html);
        var nodesToRemove = new List<HtmlAgilityPack.HtmlNode>();
        var e = doc
            .CreateNavigator()
            .SelectDescendants(System.Xml.XPath.XPathNodeType.All, false)
            .GetEnumerator();
        while (e.MoveNext())
        {
            var node =
                ((HtmlAgilityPack.HtmlNodeNavigator)e.Current)
                .CurrentNode;
            if (!whiteList.Contains(node.Name))
            {
                nodesToRemove.Add(node);
            }
        }
        nodesToRemove.ForEach(node => node.Remove());
        var sb = new StringBuilder();
        using (var w = new StringWriter(sb))
        {
            doc.Save(w);
        }
        Console.WriteLine(sb.ToString());
    }
}

GetSafeHtmlFragment删除所有html标签

2 个答案: