Question

我的C＃程序获取一些UTF-8编码数据并使用Encoding.UTF8.GetString(data)对其进行解码。当生成数据的程序获取BMP之外的字符时，它将它们编码为2个代理字符，每个字符分别编码为UTF-8。在这种情况下，我的程序无法正确解码它们。

如何在C＃中解码这些数据？

实施例

static void Main(string[] args)
{
    string orig = "";
    byte[] correctUTF8 = Encoding.UTF8.GetBytes(orig); // Simulate correct conversion using std::codecvt_utf8_utf16<wchar_t>
    Console.WriteLine("correctUTF8: " + BitConverter.ToString(correctUTF8));  // F0-9F-8C-8E - that's what the C++ program should've produced

    // Simulate bad conversion using std::codecvt_utf8<wchar_t> - that's what I get from the program
    byte[] badUTF8 = new byte[] { 0xED, 0xA0, 0xBC, 0xED, 0xBC, 0x8E };
    string badString = Encoding.UTF8.GetString(badUTF8); // ���� (4 * U+FFFD 'REPLACMENT CHARACTER')
    // How can I convert this?
}

注意：编码程序是用C ++编写的，并使用std::codecvt_utf8<wchar_t>（下面的代码）转换数据。正如@ PeterDuniho的答案正确指出的那样，它应该使用std::codecvt_utf8_utf16<wchar_t>。不幸的是， 我无法控制此程序，也无法更改其行为 - 只处理其格式错误的输入。

std::wstring_convert<std::codecvt_utf8<wchar_t>> utf8Converter;
std::string utf8str = utf8Converter.to_bytes(wstr);

Answer 1

没有好的Minimal, Complete, and Verifiable code example，我们无法确定。但它看起来好像你在C ++中使用了错误的转换器。

std::codecvt_utf8<wchar_t>语言环境从UCS-2转换而不是UTF-16。两者非常相似，但UCS-2不支持编码您要编码的字符所需的代理对。

相反，您应该使用std::codecvt_utf8_utf16<wchar_t>：

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> utf8Converter;
std::string utf8str = utf8Converter.to_bytes(wstr);

当我使用该转换器时，我得到所需的UTF-8字节：F0 9F 8C 8E。当然，这些在解释为UTF-8时在.NET中正确解码。

的附录：

问题已更新，表明无法更改编码代码。您被困在已编码为无效UTF8的UCS-2中。由于UTF8无效，您必须自己解码文本。

我看到了几种合理的方法。首先，编写一个解码器，它不关心UTF8是否包含无效的字节序列。其次，使用C ++ std::wstring_convert<std::codecvt_utf8<wchar_t>>转换器为您解码字节（例如，用C ++编写接收代码，或编写可以从C＃代码调用的C ++ DLL来完成工作）。

第二个选项在某种意义上说更可靠，即您首先使用的是完全创建坏数据的解码器。另一方面，即使创建DLL也可能有点过分，更不用说用C ++编写整个客户端了。制作DLL，即使使用C ++ / CLI，除了你已经是专家之外，你仍然有一些令人头疼的问题让interop正常工作。

我熟悉C ++ / CLI，但几乎不是专家。我对C＃好多了，所以这里有第一个选项的代码：

private const int _khighOffset = 0xD800 - (0x10000 >> 10); /// <summary> /// Decodes a nominally UTF8 byte sequence as UTF16. Ignores all data errors /// except those which prevent coherent interpretation of the input data. /// Input with invalid-but-decodable UTF8 sequences will be decoded without /// error, and may lead to invalid UTF16. /// </summary> /// <param name="bytes">The UTF8 byte sequence to decode</param> /// <returns>A string value representing the decoded UTF8</returns> /// <remarks> /// This method has not been thoroughly validated. It should be tested /// carefully with a broad range of inputs (the entire UTF16 code point /// range would not be unreasonable) before being used in any sort of /// production environment. /// </remarks> private static string DecodeUtf8WithOverlong(byte[] bytes) { List<char> result = new List<char>(); int continuationCount = 0, continuationAccumulator = 0, highBase = 0; char continuationBase = '\0'; for (int i = 0; i < bytes.Length; i++) { byte b = bytes[i]; if (b < 0x80) { result.Add((char)b); continue; } if (b < 0xC0) { // Byte values in this range are used only as continuation bytes. // If we aren't expecting any continuation bytes, then the input // is invalid beyond repair. if (continuationCount == 0) { throw new ArgumentException("invalid encoding"); } // Each continuation byte represents 6 bits of the actual // character value continuationAccumulator <<= 6; continuationAccumulator |= (b - 0x80); if (--continuationCount == 0) { continuationAccumulator += highBase; if (continuationAccumulator > 0xffff) { // Code point requires more than 16 bits, so split into surrogate pair char highSurrogate = (char)(_khighOffset + (continuationAccumulator >> 10)), lowSurrogate = (char)(0xDC00 + (continuationAccumulator & 0x3FF)); result.Add(highSurrogate); result.Add(lowSurrogate); } else { result.Add((char)(continuationBase | continuationAccumulator)); } continuationAccumulator = 0; continuationBase = '\0'; highBase = 0; } continue; } if (b < 0xE0) { continuationCount = 1; continuationBase = (char)((b - 0xC0) * 0x0040); continue; } if (b < 0xF0) { continuationCount = 2; continuationBase = (char)(b == 0xE0 ? 0x0800 : (b - 0xE0) * 0x1000); continue; } if (b < 0xF8) { continuationCount = 3; highBase = (b - 0xF0) * 0x00040000; continue; } if (b < 0xFC) { continuationCount = 4; highBase = (b - 0xF8) * 0x01000000; continue; } if (b < 0xFE) { continuationCount = 5; highBase = (b - 0xFC) * 0x40000000; continue; } // byte values of 0xFE and 0xFF are invalid throw new ArgumentException("invalid encoding"); } return new string(result.ToArray()); }

我用你的全球角色测试它，它可以正常工作。它还正确解码该角色的正确UTF8（即F0 9F 8C 8E）。如果您打算使用该代码解码所有UTF8输入，您当然希望使用全范围的数据进行测试。

如何解码编码为UTF8的代理字符？

1 个答案: