Question

我在.Net中寻找一个通用方法来编码一个用于Xml元素或属性的字符串，当我没有立即找到它时，我感到很惊讶。所以，在我走得太远之前，我可能只是缺少内置函数吗？

假设它确实不存在，我正在整理我自己的通用EncodeForXml(string data)方法，我正在考虑最好的方法。

我正在使用的数据提示整个事情可能包含不良字符，如＆amp;，＆lt;，“等。它还可能包含正确转义的实体：＆amp; amp;，＆amp; lt;，和＆amp; quot;，这意味着只使用CDATA部分可能不是最好的主意。这似乎有点笨拙;我宁愿最终得到一个很好的字符串值，可以直接在xml中使用。

我过去使用过正则表达式来捕捉坏的＆符号，我想在这种情况下使用它来捕获它们以及第一步，然后对其他字符进行简单的替换。

那么，这可以进一步优化而不会太复杂，有什么我想念的吗？：

Function EncodeForXml(ByVal data As String) As String
    Static badAmpersand As new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)")

    data = badAmpersand.Replace(data, "&amp;")

    return data.Replace("<", "&lt;").Replace("""", "&quot;").Replace(">", "gt;")
End Function

对不起你所有的C＃ - 只是大家 - 我真的不在乎我使用哪种语言，但是我想让Regex保持静态，你不能在C＃中做到这一点而不在方法之外声明它，这将是VB.Net

最后，我们仍然使用.Net 2.0，但是如果有人可以将最终产品转换为字符串类的扩展方法，那也很酷。

更新前几个回复表明.Net确实有内置的方法来做到这一点。但是现在我已经开始了，我想完成我的EncodeForXml（）方法只是为了它的乐趣，所以我仍然在寻找改进的想法。值得注意的是：一个更完整的字符列表，应该被编码为实体（可能存储在列表/映射中），并且比串行不可变字符串上的.Replace（）获得更好的性能。

Answer 1

根据您对输入的了解程度，您可能需要考虑not all Unicode characters are valid XML characters。

Server.HtmlEncode 和 System.Security.SecurityElement.Escape 似乎都忽略了非法的XML字符，而 System.XML.XmlWriter.WriteString 遇到非法字符时抛出 ArgumentException （除非您禁用该检查，否则忽略它们）。有关库函数的概述here。

编辑2011/8/14：看到在过去几年中至少有几个人咨询了这个答案，我决定完全重写原始代码，其中包含许多问题，包括{ {3}}

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

/// <summary>
/// Encodes data so that it can be safely embedded as text in XML documents.
/// </summary>
public class XmlTextEncoder : TextReader {
    public static string Encode(string s) {
        using (var stream = new StringReader(s))
        using (var encoder = new XmlTextEncoder(stream)) {
            return encoder.ReadToEnd();
        }
    }

    /// <param name="source">The data to be encoded in UTF-16 format.</param>
    /// <param name="filterIllegalChars">It is illegal to encode certain
    /// characters in XML. If true, silently omit these characters from the
    /// output; if false, throw an error when encountered.</param>
    public XmlTextEncoder(TextReader source, bool filterIllegalChars=true) {
        _source = source;
        _filterIllegalChars = filterIllegalChars;
    }

    readonly Queue<char> _buf = new Queue<char>();
    readonly bool _filterIllegalChars;
    readonly TextReader _source;

    public override int Peek() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Peek();
    }

    public override int Read() {
        PopulateBuffer();
        if (_buf.Count == 0) return -1;
        return _buf.Dequeue();
    }

    void PopulateBuffer() {
        const int endSentinel = -1;
        while (_buf.Count == 0 && _source.Peek() != endSentinel) {
            // Strings in .NET are assumed to be UTF-16 encoded [1].
            var c = (char) _source.Read();
            if (Entities.ContainsKey(c)) {
                // Encode all entities defined in the XML spec [2].
                foreach (var i in Entities[c]) _buf.Enqueue(i);
            } else if (!(0x0 <= c && c <= 0x8) &&
                       !new[] { 0xB, 0xC }.Contains(c) &&
                       !(0xE <= c && c <= 0x1F) &&
                       !(0x7F <= c && c <= 0x84) &&
                       !(0x86 <= c && c <= 0x9F) &&
                       !(0xD800 <= c && c <= 0xDFFF) &&
                       !new[] { 0xFFFE, 0xFFFF }.Contains(c)) {
                // Allow if the Unicode codepoint is legal in XML [3].
                _buf.Enqueue(c);
            } else if (char.IsHighSurrogate(c) &&
                       _source.Peek() != endSentinel &&
                       char.IsLowSurrogate((char) _source.Peek())) {
                // Allow well-formed surrogate pairs [1].
                _buf.Enqueue(c);
                _buf.Enqueue((char) _source.Read());
            } else if (!_filterIllegalChars) {
                // Note that we cannot encode illegal characters as entity
                // references due to the "Legal Character" constraint of
                // XML [4]. Nor are they allowed in CDATA sections [5].
                throw new ArgumentException(
                    String.Format("Illegal character: '{0:X}'", (int) c));
            }
        }
    }

    static readonly Dictionary<char,string> Entities =
        new Dictionary<char,string> {
            { '"', "&quot;" }, { '&', "&amp;"}, { '\'', "&apos;" },
            { '<', "&lt;" }, { '>', "&gt;" },
        };

    // References:
    // [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
    // [2] http://www.w3.org/TR/xml11/#sec-predefined-ent
    // [3] http://www.w3.org/TR/xml11/#charsets
    // [4] http://www.w3.org/TR/xml11/#sec-references
    // [5] http://www.w3.org/TR/xml11/#sec-cdata-sect
}

可以找到单元测试和完整代码horribly mishandling UTF-16。

Answer 2

SecurityElement.Escape

记录here

Answer 3

过去我使用过HttpUtility.HtmlEncode来编码xml的文本。它真的执行相同的任务。我还没有遇到任何问题，但这并不是说我将来也不会。顾名思义，它是为HTML制作的，而不是XML。

您可能已经阅读过它，但here is an article有关xml编码和解码的信息。

编辑：当然，如果您使用xmlwriter或其中一个新的XElement类，则会为您完成此编码。实际上，您可以只获取文本，将其放在新的XElement实例中，然后返回元素的字符串（.tostring）版本。我听说SecurityElement.Escape也会执行与实用程序方法相同的任务，但是没有阅读或使用过它。

EDIT2：忽略我对XElement的评论，因为你还在2.0

Answer 4

微软在System.Web.dll中的 ~~AntiXss library~~ AntiXssEncoder Class有以下方法：

AntiXss.XmlEncode(string s)
AntiXss.XmlAttributeEncode(string s)

它也有HTML：

AntiXss.HtmlEncode(string s)
AntiXss.HtmlAttributeEncode(string s)

Answer 5

~~在.net 3.5 +~~

new XText("I <want> to & encode this for XML").ToString();

给你：

<击> I <want> to & encode this for XML

事实证明，这种方法不会对它应该编写的东西（如引号）进行编码。

SecurityElement.Escape（workmad3's answer）似乎在这方面做得更好，并且它已包含在早期版本的.net中。

如果您不介意第三方代码，并且希望确保没有非法字符进入您的XML，我建议Michael Kropat's answer。

Answer 6

XmlTextWriter.WriteString()可以逃脱。

Answer 7

如果这是一个ASP.NET应用程序，为什么不使用Server.HtmlEncode（）？

Answer 8

System.XML为您处理编码，因此您不需要这样的方法。

Answer 9

这可能是您可以从使用WriteCData方法中受益的情况。

public override void WriteCData(string text)
    Member of System.Xml.XmlTextWriter

Summary:
Writes out a <![CDATA[...]]> block containing the specified text.

Parameters:
text: Text to place inside the CDATA block.

一个简单的例子如下所示：

writer.WriteStartElement("name");
writer.WriteCData("<unsafe characters>");
writer.WriteFullEndElement();

结果如下：

<name><![CDATA[<unsafe characters>]]></name>

当读取节点值时，XMLReader会自动删除innertext的CData部分，因此您不必担心它。唯一的问题是您必须将数据作为innerText值存储到XML节点。换句话说，您无法将CData内容插入属性值。

Answer 10

辉煌！这就是我能说的全部。

这是更新代码的VB变体（不在类中，只是一个函数），它将清理并清理xml

Function cXML(ByVal _buf As String) As String
    Dim textOut As New StringBuilder
    Dim c As Char
    If _buf.Trim Is Nothing OrElse _buf = String.Empty Then Return String.Empty
    For i As Integer = 0 To _buf.Length - 1
        c = _buf(i)
        If Entities.ContainsKey(c) Then
            textOut.Append(Entities.Item(c))
        ElseIf (AscW(c) = &H9 OrElse AscW(c) = &HA OrElse AscW(c) = &HD) OrElse ((AscW(c) >= &H20) AndAlso (AscW(c) <= &HD7FF)) _
            OrElse ((AscW(c) >= &HE000) AndAlso (AscW(c) <= &HFFFD)) OrElse ((AscW(c) >= &H10000) AndAlso (AscW(c) <= &H10FFFF)) Then
            textOut.Append(c)
        End If
    Next
    Return textOut.ToString

End Function

Shared ReadOnly Entities As New Dictionary(Of Char, String)() From {{""""c, "&quot;"}, {"&"c, "&amp;"}, {"'"c, "&apos;"}, {"<"c, "&lt;"}, {">"c, "&gt;"}}

Answer 11

您可以使用自动处理编码的内置类XAttribute：

using System.Xml.Linq;

XDocument doc = new XDocument();

List<XAttribute> attributes = new List<XAttribute>();
attributes.Add(new XAttribute("key1", "val1&val11"));
attributes.Add(new XAttribute("key2", "val2"));

XElement elem = new XElement("test", attributes.ToArray());

doc.Add(elem);

string xmlStr = doc.ToString();

Answer 12

这是使用XElements的单行解决方案。我在一个非常小的工具中使用它。我不需要它第二次，所以我保持这种方式。（它的dirdy doug）

StrVal = (<x a=<%= StrVal %>>END</x>).ToString().Replace("<x a=""", "").Replace(">END</x>", "")

哦，它只适用于VB而不是C＃

Answer 13

如果你真的想要处理所有无效字符（而不仅仅是少数几个“html”字符），并且你有权访问System.Xml，这是最简单的方法值数据的正确Xml编码：

string theTextToEscape = "Something \x1d else \x1D <script>alert('123');</script>";
var x = new XmlDocument();
x.LoadXml("<r/>"); // simple, empty root element
x.DocumentElement.InnerText = theTextToEscape; // put in raw string
string escapedText = x.DocumentElement.InnerXml; // Returns:  Something &#x1D; else &#x1D; &lt;script&gt;alert('123');&lt;/script&gt;

// Repeat the last 2 lines to escape additional strings.

重要的是要知道XmlConvert.EncodeName()不合适，因为那是实体/标签名称，而不是值。当您需要进行Html编码时，使用它就像Url编码。

编码XML文本数据的最佳方法

13 个答案: