Question

我已经环顾了很多但是却找不到只能转义特殊XML字符的内置.Net方法： <，>，&，'和" 如果它不是标签。

例如，请采用以下文字：

Test& <b>bold</b> <i>italic</i> <<Tag index="0" />

我希望将其转换为：

Test&amp; <b>bold</b> <i>italic</i> &lt;<Tag index="0" />

请注意，标签不会转义。我基本上需要将此值设置为InnerXML的{{1}}，因此必须保留这些标记。

我已经研究过实现自己的解析器并使用XmlElement来尽可能地优化它，但它可能会非常讨厌。

我也知道可以接受的标签可以简化事情（仅限：br，b，i，u，blink，flash，Tag）。此外，这些标签可以是自动关闭标签

StringBuilder

或容器标签

(e.g. <u />)

Answer 1

注意：这可能是优化的。这只是我为你快速敲门的事情。另请注意，我没有对标签本身进行任何验证。它只是寻找包含在尖括号中的内容。如果在标签内找到尖括号（例如<sometag label="I put an > here">），它也将失败。除此之外，我认为它应该做你想要的。

namespace ConsoleApplication1
{
    using System;
    using System.Text.RegularExpressions;

    class Program
    {
        static void Main(string[] args)
        {
            // This is the test string.
            const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />";

            // Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or
            // a character that needs escaping.
            string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) =>
                {
                    // If a special (escapable) character was found, replace it.
                    if (match.Groups["Special"].Success)
                    {
                        switch (match.Groups["Special"].Value)
                        {
                            case "<":
                                return "&lt;";
                            case ">":
                                return "&gt;";
                            case "\"":
                                return "&quot;";
                            case "\'":
                                return "&apos;";
                            case "&":
                                return "&amp;";
                            default:
                                return match.Groups["Special"].Value;
                        }
                    }

                    // Otherwise, just return what was found.
                    return match.Value;
                });

            // Show the result.
            Console.WriteLine("Test String: " + testString);
            Console.WriteLine("Result     : " + result);
            Console.ReadKey();
        }
    }
}

Answer 2

我个人认为这不可能，因为你真的想要修复格式错误的HTML，因此没有规则可以用来确定要编码的内容和不编码的内容。

您查看它的任何方式，例如<<Tag index="0" />都不是有效的HTML。

如果您知道实际的标签，您可以创建一个可以简化操作的白名单，但是您将不得不更具体地攻击您的问题，我认为您无法在任何情况下解决此问题。

事实上，你的文本中可能没有任何随机的<或>，这可能（可能）大大简化了问题，但如果你真的在尝试想出一个通用的解决方案....祝你好运。

Answer 3

这是一个可以使用的正则表达式，可以匹配任何无效的<或>。

(\<(?! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))|(?<! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))\>)

我建议将有效的tag-test表达式放入变量中，然后围绕它构建其余部分。

var validTags = "b|i|br|u|blink|flash|Tag[^>]*";
var startTag = @"\<(?! ?/?(?:" + validTags + "))";
var endTag = @"(?<! ?/?(?:" + validTags + "))/>";

然后就这些做RegEx.Replace。

有条件地转义特殊的xml字符

3 个答案: