Question

例如，我有这个Unicode字符串，它由C＃和.NET中定义的Cyclone和Japanese Castle组成，它使用UTF-16进行CLR字符串编码：

var value = "";

如果你检查一下，你会很快找到value.Length = 4因为C＃使用UTF-16编码的字符串，所以由于这些原因，我不能只循环每个字符并得到它的UTF-32十进制值：foreach (var character in value) result = (ulong)character;。这引出了一个问题，我怎样才能获得任何字符串中每个字符的UTF-32十进制值？

Cyclone应该是127744而日语城堡应该是127983，但我正在寻找一个可以接受任何C＃字符串的一般答案，并且总是从每个字符中产生一个UTF-32十进制值它的。

我甚至试过看一下Char.ConvertToUtf32，但如果例如：

，这似乎有问题

var value = "ac";

长度为6.那么，我如何知道新角色何时开始？例如：

Char.ConvertToUtf32(value, 0)   97  int
Char.ConvertToUtf32(value, 1)   127744  int
Char.ConvertToUtf32(value, 2)   'Char.ConvertToUtf32(value, 2)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}
Char.ConvertToUtf32(value, 3)   99  int
Char.ConvertToUtf32(value, 4)   127983  int
Char.ConvertToUtf32(value, 5)   'Char.ConvertToUtf32(value, 5)' threw an exception of type 'System.ArgumentException'   int {System.ArgumentException}

还有：

public static int ConvertToUtf32(
    char highSurrogate,
    char lowSurrogate
)

但是我也要使用它，我需要弄明白我有代理对。你怎么能这样做？

Answer 1

这是一种扩展方法，说明了一种方法。这个想法是你可以循环遍历字符串的每个字符，并使用char.ConvertToUtf32(string, index)来获取unicode值。如果返回的值大于0xFFFF，那么您知道unicode值由一组代理项字符组成，您可以相应地调整索引值以跳过第二个代理项字符。

扩展方法：

public static IEnumerable<int> GetUnicodeCodePoints(this string s)
{
    for (int i = 0; i < s.Length; i++)
    {
        int unicodeCodePoint = char.ConvertToUtf32(s, i);
        if (unicodeCodePoint > 0xffff)
        {
            i++;
        }
        yield return unicodeCodePoint;
    }
}

样本用法：

static void Main(string[] args)
{
    string s = "ac";

    foreach(int unicodeCodePoint in s.GetUnicodeCodePoints())
    {
        Console.WriteLine(unicodeCodePoint);
    }
}

Answer 2

解决方案1 

string value = "";
byte[] rawUtf32AsBytes = Encoding.UTF32.GetBytes(value);
int[] rawUtf32 = new int[rawUtf32AsBytes.Length / 4];
Buffer.BlockCopy(rawUtf32AsBytes, 0, rawUtf32, 0, rawUtf32AsBytes.Length);

解决方案2

string value = "";
List<int> rawUtf32list = new List<int>();
for (int i = 0; i < value.Length; i++)
{
    if (Char.IsHighSurrogate(value[i]))
    {
        rawUtf32list.Add(Char.ConvertToUtf32(value[i], value[i + 1]));
        i++;
    }
    else
        rawUtf32list.Add((int)value[i]);
}

如何读取字符串中的字符作为UTF-32十进制值？

2 个答案: