Question

将UTF-8编码的字符串转换为UTF-16编码的CStringW时遇到问题。

这是我的源代码：

CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 )
{
    _wsetlocale( LC_ALL, L"Korean" );
    if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') )
    {
        return L"";
    }
    const size_t cchUTF8Max = INT_MAX - 1;
    size_t cchUTF8;
    HRESULT hr = ::StringCbLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );
    if ( FAILED( hr ) )
    {
        AtlThrow( hr );
    }
    ++cchUTF8;
    int cbUTF8 = static_cast<int>( cchUTF8 );

    int cchUTF16 = ::MultiByteToWideChar(
        CP_UTF8,
        MB_ERR_INVALID_CHARS,
        pszTextUTF8,
        -1,
        NULL,
        0
        );

    CString strUTF16;
    strUTF16.GetBufferSetLength(cbUTF8);
    WCHAR * pszUTF16 = new WCHAR[cchUTF16];

    int result = ::MultiByteToWideChar(
        CP_UTF8,
        0,
        pszTextUTF8,
        cbUTF8,
        pszUTF16,
        cchUTF16
        );
    ATLASSERT( result != 0 );
    if ( result == 0 )
    {
        AtlThrowLastWin32();
    }
    strUTF16.Format(_T("%s"), pszUTF16);
    return strUTF16;
}

pszTextUTF8是UTF-8中的htm文件内容。当htm文件的卷小于500kb时，此代码运行良好。但是，当转换超过500kb的htm文件时，（我有648KB htm文件。） pszUTF16包含文件的所有内容，但strUTF16不是。strUTF16 m_pszData。（约一半）我猜文件打开没错。

在strUTF16.Getbuffer();中有所有内容我该怎么做？ XmlDocument.Load不能工作。

Answer 1

问题中的代码充满了错误，每1-2行代码大约有1个错误。

以下是简短摘要：

_wsetlocale( LC_ALL, L"Korean" );

更改转换函数中的全局设置是意外的，并且会破坏调用它的代码。它甚至都没有必要;您没有使用语言环境进行编码转换。

HRESULT hr = ::StringCbLengthA( pszTextUTF8, cchUTF8Max, &cchUTF8 );

这传递了错误的cchUTF8Max值（根据documentation），并计算字节数（相对于字符数，即代码单位）。除此之外，你甚至不需要知道代码单元的数量，因为你从来没有使用它（嗯，你是，但这只是另一个错误）。

int cbUTF8 = static_cast<int>( cchUTF8 );

虽然这样可以修复前缀（ c > b ytes而不是 c c 更多的字符），它不会让你以后再使用它来获得具有无关值的东西。

strUTF16.GetBufferSetLength(cbUTF8);

这会调整最终应保存UTF-16编码字符的字符串对象的大小。但是它没有使用正确数量的字符（前一次调用MultiByteToWideChar会提供该值），而是选择一个完全不相关的值：UTF中的字节数 -8编码的源字符串。

但它不仅仅停在那里，那行代码也抛弃了指向内部缓冲区的指针，它已准备好写入。未能致电ReleaseBuffer只是一个自然的后果，因为您决定不反对阅读documentation。

WCHAR * pszUTF16 = new WCHAR[cchUTF16];

虽然本身不是一个bug，但它不必要地分配另一个缓冲区（这次传递正确的大小）。您在之前的GetBufferSetLength调用中已经分配了所需大小的缓冲区（尽管错误）。只需使用它，这就是成员函数的用途。

strUTF16.Format(_T("%s"), pszUTF16);

这可能是与printf系列函数关联的反模式。写CopyChars（或Append）是一种令人费解的方式。

现在已经清除了，这是编写该函数的正确方法（或者至少有一种方法）：

CStringW ConvertUTF8ToUTF16( __in const CHAR * pszTextUTF8 ) {
    // Allocate return value immediately, so that (N)RVO can be applied
    CStringW strUTF16;
    if ( (pszTextUTF8 == NULL) || (*pszTextUTF8 == '\0') ) {
        return strUTF16;
    }

    // Calculate the required destination buffer size
    int cchUTF16 = ::MultiByteToWideChar( CP_UTF8,
                                          MB_ERR_INVALID_CHARS,
                                          pszTextUTF8,
                                          -1,
                                          nullptr,
                                          0 );

    // Perform error checking
    if ( cchUTF16 == 0 ) {
        throw std::runtime_error( "MultiByteToWideChar failed." );
    }

    // Resize the output string size and use the pointer to the internal buffer
    wchar_t* const pszUTF16 = strUTF16.GetBufferSetLength( cchUTF16 );

    // Perform conversion (return value ignored, since we just checked for success)
    ::MultiByteToWideChar( CP_UTF8,
                           MB_ERR_INVALID_CHARS, // Use identical flags
                           pszTextUTF8,
                           -1,
                           pszUTF16,
                           cchUTF16 );

    // Perform required cleanup
    strUTF16.ReleaseBuffer();

    // Return converted string
    return strUTF16;
}

如何将大型UTF-8编码的char *字符串转换为CStringW（UTF-16）？

1 个答案: