1。字符无效

Question

我目前正在处理一些XML。

我的节点包含如下字符串：

<node>This is a string</node>

我传递给节点的一些字符串将包含＆amp;，＃，$等字符。

<node>This is a string & so is this</node>

由于＆amp;

，这是无效的

我无法将这些字符串包装在CDATA中，因为它们需要保持原样。我已经尝试在线查找一些字符列表，这些字符不能放入XML节点而不是CDATA。

有人能指出我的方向或向我提供非法字符列表吗？

Answer 1

好的，让我们将（1）任何XML文档中完全无效的字符和（2）需要转义的字符分开的问题：

@dolmen Invalid Characters in XML提供的答案仍然有效，但需要使用XML 1.1规范进行更新。

1。字符无效

此处描述的字符是允许插入XML文档的所有字符。

1.1。在XML 1.0

中

参考：请参阅XML recommendation 1.0, §2.2 Characters

允许字符的全局列表是：

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

基本上，不允许使用Unicode范围之外的控制字符和字符。这也意味着禁止调用例如字符实体。

1.2。在XML 1.1

中

参考：请参阅XML recommendation 1.1, §2.2 Characters和1.3 Rationale and list of changes for XML 1.1

允许字符的全局列表是：

[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

XML建议的这一修订扩展了允许的字符，因此允许控制字符，并考虑到Unicode标准的新版本，但仍然不允许这些： NUL（x00）， xFFFE ， xFFFF ...

但是，不鼓励使用控制字符和未定义的Unicode字符。

还可以注意到，所有解析器并不总是考虑到这一点，并且可能会拒绝带有控制字符的XML文档。

2。需要转义的字符（以获得格式良好的文档）：

<必须使用<实体进行转义，因为它被认为是标记的开头。

&必须使用&实体进行转义，因为它被假定为实体引用的开头

>应使用>实体进行转义。它不是强制性的 - 它取决于上下文 - 但强烈建议逃避它。

'应使用'实体进行转义 - 在单引号内定义的属性中必须使用，但强烈建议您始终将其转义。

"应使用"实体进行转义 - 在双引号内定义的属性中必须使用，但强烈建议您始终将其转义。

Answer 2

有效字符列表位于XML specification：

Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Answer 3

唯一的非法字符是&，<和>（以及属性中的"或'。）

他们使用XML entities进行转义，在这种情况下，&需要&。

但是，真的，你应该使用一个为你编写XML的工具或库，并为你抽象出这种东西，这样你就不用担心了。

Answer 4

这是一个C＃代码，用于从字符串中删除XML无效字符并返回新的有效字符串。

public static string CleanInvalidXmlChars(string text) 
{ 
    // From xml spec valid chars: 
    // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]     
    // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. 
    string re = @"[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF]"; 
    return Regex.Replace(text, re, ""); 
}

Answer 5

预先声明的字符是：

& < > " '

http://xml.silmaril.ie/specials.html

Answer 6

在C＃中逃避可能不需要的XML / XHTML字符的另一种简单方法是：

WebUtility.HtmlEncode(stringWithStrangeChars)

Answer 7

除了potame的答案之外，如果你想使用CDATA块逃脱。

如果您将文字放在CDATA块中，那么您就不需要使用转义。在这种情况下，您可以使用以下范围内的所有字符：

注意：除此之外，您不允许使用]]>字符序列。因为它匹配CDATA块的末尾。

如果仍有无效字符（例如控制字符），那么使用某种编码（例如base64）可能会更好。

Answer 8

使用XmlConvert.IsXmlChar Method在C＃中删除不正确的XML字符的另一种方法（自.NET Framework 4.0起可用）

<div class="__main">
<div class="__article">
	<div class="div__img">
		<img src="http://via.placeholder.com/300x400" width="100%" />
	</div>
	<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
</div>
<div class="__article">
	<div class="div__img">
		<img src="http://via.placeholder.com/300x400" width="100%" />
	</div>
	<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
</div>
	<div class="__article">
	<div class="div__img">
		<img src="http://via.placeholder.com/300x400" width="100%" />
	</div>
	<p>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</p>
</div>
</div>

或者您可以检查所有字符是否都是XML有效的。

public static string RemoveInvalidXmlChars(string content)
{
   return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray());
}

.Net小提琴 - https://dotnetfiddle.net/v1TNus

例如，垂直制表符号（\ v）对XML无效，它是有效的UTF-8，但不是有效的XML 1.0，甚至许多库（包括libxml2）都会错过它并静默输出无效的XML。 / p>

Answer 9

这个答案对我有用

string code = Regex.Replace(item.Code, @"[\u0000-\u0008,\u000B,\u000C,\u000E-\u001F]", "");

此link to Blog

中的详细信息

Answer 10

对于Java人员，Apache有一个实用程序类（StringEscapeUtils），它有一个辅助方法escapeXml，可用于使用XML实体转义字符串中的字符。

Answer 11

在Woodstox XML处理器中，无效字符按此代码分类

if (c == 0) {
    throw new IOException("Invalid null character in text to output");
}
if (c < ' ' || (c >= 0x7F && c <= 0x9F)) {
    String msg = "Invalid white space character (0x" + Integer.toHexString(c) + ") in text to output";
    if (mXml11) {
        msg += " (can only be output using character entity)";
    }
    throw new IOException(msg);
}
if (c > 0x10FFFF) {
    throw new IOException("Illegal unicode character point (0x" + Integer.toHexString(c) + ") to output; max is 0x10FFFF as per RFC");
}
/*
 * Surrogate pair in non-quotable (not text or attribute value) content, and non-unicode encoding (ISO-8859-x,
 * Ascii)?
 */
if (c >= SURR1_FIRST && c <= SURR2_LAST) {
    throw new IOException("Illegal surrogate pair -- can only be output via character entities, which are not allowed in this content");
}
throw new IOException("Invalid XML character (0x"+Integer.toHexString(c)+") in text to output");

来自here

的来源

Answer 12

ampersand (&) is escaped to &amp;

double quotes (") are escaped to &quot;

single quotes (') are escaped to &apos; 

less than (<) is escaped to &lt; 

greater than (>) is escaped to &gt;

在C＃中，使用System.Security.SecurityElement.Escape或System.Net.WebUtility.HtmlEncode来转义这些非法字符。

string xml = "<node>it's my \"node\" & i like it 0x12 x09 x0A  0x09 0x0A <node>";
string encodedXml1 = System.Security.SecurityElement.Escape(xml);
string encodedXml2= System.Net.WebUtility.HtmlEncode(xml);


encodedXml1
"&lt;node&gt;it&apos;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"

encodedXml2
"&lt;node&gt;it&#39;s my &quot;node&quot; &amp; i like it 0x12 x09 x0A  0x09 0x0A &lt;node&gt;"

Answer 13

有人试过这个System.Security.SecurityElement.Escape(yourstring)吗？这将使用有效等效的

替换字符串中的无效XML字符

Answer 14

总而言之，文本中的有效字符为：

制表符，换行和回车；
所有非控制字符均有效除和＆和<;
>无效。

XML规范的第2.2和2.4节提供了详细的答案：

字符

法律字符是制表符，回车符，换行符以及Unicode和ISO / IEC 10646的法律字符

字符数据

＆字符（＆）和左尖括号（<）不得以文字形式显示，除非用作标记定界符，或在注释，处理指令或CDATA部分中。如果它们在其他地方是必需的，必须使用任一数字对其进行转义字符引用或字符串“＆”和“ <” 分别。直角括号（>）可以使用字符串“>”，并且出于兼容性考虑，必须使用以下任意一种进行转义 “>”或出现在字符串“]]>中的字符引用在内容中，如果该字符串未标记CDATA的结尾部分。

Answer 15

对于XSL（在非常懒惰的日子里）我使用：

capture="&amp;(?!amp;)" capturereplace="&amp;amp;"

翻译所有＆amp; -signs，这些＆amp; -signs不是påamp;适当的。

我们有输入在CDATA中的情况，但使用XML的系统不考虑它。这是一个草率的修复，要小心......

XML中的字符无效

15 个答案:

1。字符无效

1.1。在XML 1.0

1.2。在XML 1.1

2。需要转义的字符（以获得格式良好的文档）：