Question

我有一个问题是剥离HTML并显示为客户格式的文本。

例如：

asdas<br/>asdas

因此标签将被保证金替换。但我还需要用空格和制表符替换边距并删除所有标签。是否有任何示例或已完成的解决方案，以便在删除HTML标记后以某种方式获取格式化文本。

当前解决方案（寻找更好并完成）：

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
    /// <summary>
    /// Compiled regular expression for performance.
    /// </summary>
    static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// Remove HTML from string with compiled Regex.
    /// </summary>
    public static string StripAllTagsRegex(string source)
    {
        source = HttpUtility.HtmlEncode(source);
        return _htmlRegex.Replace(source, string.Empty);
    }

    public static string ChangeTagsToTextFormat(string source)
    {
        if (string.IsNullOrEmpty(source))
            return source;

        source = HttpUtility.HtmlEncode(source);
        return source.Replace("<br/>", Environment.NewLine)
            .Replace("</div>", Environment.NewLine)
             .Replace("</p>", Environment.NewLine);
    }
}

Answer 1

我相信HTML Agility Pack是最简单的解决方案，特别是因为你删除（可能是格式错误的）Html标签。下面代码背后的想法是你只需要占用所有节点，返回他们的InnerText以及换行符（“\ n”，或者你想要做的任何格式，因为你在使用SelectNodes之后会有一个Collection可以使用）：

    private string stripTags(string html)
    {
        var output = new StringBuilder();
        HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

        doc.LoadHtml(html);

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//*"))
        {
            output.AppendLine(node.InnerText + Environment.NewLine);
        }

        return output.ToString();
    }

要获得更具体的格式化结果，只需使用SelectNodes方法使用不同的XPath表达式。（此处提供的代码未进行实际测试，您可能希望更精确一些）

Answer 2

Don't use regular expressions to parse HTML

使用类似HTML Agility Pack的内容。 Here's an introduction to how to use it

Answer 3

如果您使用Microsoft Sharepoint，则可以SPHttpUtility

归档

示例：

using Microsoft.SharePoint;

[Test]
public void RemoveHtml()
{
    string textWithHtml = "<div class='ExternalCla48D45'>value</div>";
    textWithHtml = SPHttpUtility.ConvertSimpleHtmlToText(multilinetext, -1);
    Assert.That(textWithHtml, Is.EqualTo("value"));
}

它非常适用于多行字段。

使用.NET删除所有HTML标记并使用返回，空格等格式化文本

3 个答案: