Question

.NET String对象是否可能包含无效的Unicode代码点？

如果是，可能会发生这种情况（如何确定字符串是否有这样的无效字符）？

Answer 1

是的，有可能。根据微软的文档，.NET String只是

String对象是表示字符串的System.Char对象的顺序集合。

而.NET Char

将字符表示为UTF-16代码单元。

总而言之，这意味着.NET String只是一系列UTF-16代码单元，无论它们是否是符合Unicode标准的有效字符串。有很多方法可以实现，我能想到的一些比较常见的方法是：

非UTF-16字节流被错误地放入String对象而没有正确转换。
String对象在代理对之间分开。
有人故意包含这样一个String来测试系统的健壮性。

因此，以下C＃代码完全合法并将编译：

class Test
    static void Main(){
        string s = 
            "\uEEEE" + // A private use character
            "\uDDDD" + // An unpaired surrogate character
            "\uFFFF" + // A Unicode noncharacter
            "\u0888";  // A currently unassigned character       
        System.Console.WriteLine(s); // Output is highly console dependent
    }
}

Answer 2

虽然@DPenner给出的响应很好（我用它作为起点），但我想提供一些其他细节。除了我认为是无效字符串的明显标志的孤立代理之外，字符串总是有可能包含未分配的代码点，并且这种情况不能被.NET Framework视为错误，因为新标记始终添加到Unicode标准中，例如，请参阅Unicode http://en.wikipedia.org/wiki/Unicode#Versions的版本。并且，为了使事情更清楚，此调用Char.GetUnicodeCategory(Char.ConvertFromUtf32(0x1F01C), 0);在使用.NET 2.0时返回UnicodeCategory.OtherNotAssigned，但在使用.NET 4.0时它将返回UnicodeCategory.OtherSymbol。

除此之外，还有一个有趣的观点：即使.NET类库方法也不同意如何处理Unicode非字符和未配对的代理字符。例如：

不成对的代理人char
- System.Text.Encoding.Unicode.GetBytes("\uDDDD"); - 返回{ 0xfd, 0xff} Replacement character的编码，即数据被视为无效。
- "\uDDDD".Normalize(); - 在消息＆＃34;索引0处找到无效的Unicode代码点时抛出异常。＆＃34;，即数据被视为无效。
非字符代码点
- System.Text.Encoding.Unicode.GetBytes("\uFFFF"); - 返回{0xff, 0xff}，即数据被视为有效。
- "\uFFFF".Normalize(); - 在消息＆＃34;在索引0处找到无效的Unicode代码点时抛出异常。＆＃34;，即数据被视为无效。

下面是一个搜索字符串中无效字符的方法：

/// <summary>
/// Searches invalid charachters (non-chars defined in Unicode standard and invalid surrogate pairs) in a string
/// </summary>
/// <param name="aString"> the string to search for invalid chars </param>
/// <returns>the index of the first bad char or -1 if no bad char is found</returns>
static int FindInvalidCharIndex(string aString)
{
    int ch;
    int chlow;

    for (int i = 0; i < aString.Length; i++)
    {
        ch = aString[i];
        if (ch < 0xD800) // char is up to first high surrogate
        {
            continue;
        }
        if (ch >= 0xD800 && ch <= 0xDBFF)
        {
            // found high surrogate -> check surrogate pair
            i++;
            if (i == aString.Length)
            {
                // last char is high surrogate, so it is missing its pair
                return i - 1;
            }

            chlow = aString[i];
            if (!(chlow >= 0xDC00 && chlow <= 0xDFFF))
            {
                // did not found a low surrogate after the high surrogate
                return i - 1;
            }

            // convert to UTF32 - like in Char.ConvertToUtf32(highSurrogate, lowSurrogate)
            ch = (ch - 0xD800) * 0x400 + (chlow - 0xDC00) + 0x10000;
            if (ch > 0x10FFFF)
            {
                // invalid Unicode code point - maximum excedeed
                return i;
            }
            if ((ch & 0xFFFE) == 0xFFFE)
            {
                // other non-char found
                return i;
            }
            // found a good surrogate pair
            continue;
        }

        if (ch >= 0xDC00 && ch <= 0xDFFF)
        {
            // unexpected low surrogate
            return i;
        }

        if (ch >= 0xFDD0 && ch <= 0xFDEF)
        {
            // non-chars are considered invalid by System.Text.Encoding.GetBytes() and String.Normalize()
            return i;
        }

        if ((ch & 0xFFFE) == 0xFFFE)
        {
            // other non-char found
            return i;
        }
    }

    return -1;
}

Answer 3

.NET和C＃中的所有字符串都使用UTF-16编码，但有例外（取自Jon Skeet's blog）：

......有两种不同的表现形式：大部分时间都是UTF-16 使用，但属性构造函数参数使用UTF-8 ...

Answer 4

我认为.NET String中的无效代码点只有在某人将单个元素设置为hi-或lo-surrogate时才会出现。也可能有人从有效的代理对中删除了一个hi或lo-surrogate，后者不仅可以通过删除元素而且还可以通过更改元素的值来实现。在我看来，答案是＆＃34;是＆＃34;，它可能发生，唯一的原因可能是字符串中有一个孤立的hi-或lo-surrogate。你有一个真正的示例字符串吗？将它发布在这里，我可以查看错误。

B.t.w。对于UTF-16文件也是如此。这有可能发生。对于带有0xFFEE BOM的utf-16LE文件，请确保您的第一个字符不是0，因为那么您的前4个字节是0xFFFE0000，这肯定会被解释为utf-32LE BOM而不是utf-16LE BOM！

.NET String对象和无效的Unicode代码点

4 个答案: