Question

我正在尝试将Unicode字符串转换为UTF8字符串：

#include <stdio.h>
#include <string>
#include <atlconv.h>
#include <atlstr.h>

using namespace std;

CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
    if (uni.IsEmpty()) return "";
    CStringA utf8;
    int cc = 0;

    if ((cc = WideCharToMultiByte(CP_UTF8, 0, uni, -1, NULL, 0, 0, 0) - 1) > 0)
    {
        char *buf = utf8.GetBuffer(cc);
        if (buf) WideCharToMultiByte(CP_UTF8, 0, uni, -1, buf, cc, 0, 0);
        utf8.ReleaseBuffer();
    }
    return utf8;
}

int main(void)
{
    string u8str = ConvertUnicodeToUTF8(L"gökhan");

    printf("%d\n", u8str.size());

    return 0;
}

我的问题是：u8str.size（）的返回值应该是6吗？它现在打印7个！

Answer 1

7是对的。非ASCII字符ö用两个字节编码。

Answer 2

根据定义，“多字节”表示每个unicode实体最多可占用6个字节，请参见此处：How many bytes does one Unicode character take?

进一步阅读：http://www.joelonsoftware.com/articles/Unicode.html

Answer 3

Unicode代码点在UTF-16中使用2或4个字节，但在UTF-8中使用1-4个字节，具体取决于其值。 UTF-16中的2字节代码点值可能在UTF-8中使用3-4个字节，因此UTF-8字符串可能使用比相应的UTF-16字符串更多的字节。对于拉丁语/西方语言，UTF-8往往更紧凑，但对于东亚语言，UTF-16往往更紧凑。

std::(w)string::size()和CStringT::GetLength()计算编码的代码单元的数量，而不是代码点的数量。在您的示例中，"gökhan"编码为：

UTF-16LE：0x0067 0x00f6 0x006b 0x0068 0x0061 0x006e
UTF-16BE：0x6700 0xf600 0x6b00 0x6800 0x6100 0x6e00
UTF-8：0x67 0xc3 0xb6 0x6b 0x68 0x61 0x6e

请注意ö使用UTF-16中的1个代码单元编码（LE：0x00f6，BE：0xf600）但使用UTF-8中的2个代码单元（0xc3 0xb6 ）。这就是你的UTF-8字符串大小为7而不是6的原因。

话虽如此，当以{-1}作为源长度调用WideCharToMultiByte()和MultiByteToWideChar()时，该函数必须手动计算字符数，并且返回值将包含空终止符的空间。目标指针为NULL。使用CStringA/W，std::(w)string等时，您不需要额外的空间，并且当源已知道其长度时，您不需要计算字符的开销。您应该在知道时指定实际的源长度，例如：

CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
    CStringA utf8;

    int cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), NULL, 0, 0, 0);
    if (cc > 0)
    {
        char *buf = utf8.GetBuffer(cc);
        if (buf)
        {
            cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), buf, cc, 0, 0);
            utf8.ReleaseBuffer(cc);
        }
    }

    return utf8;
}

Unicode到UTF8对话

3 个答案: