Question

我正试图找出在c ++中处理unicode的正确方法。我想了解g ++如何处理文字宽字符串，以及包含unicode字符的常规c字符串。我已经设置了一些基本的测试，但并不真正了解正在发生的事情。

wstring ws1(L"«¬.txt"); // these first 2 characters correspond to 0xAB, 0xAC
string s1("«¬.txt");

ifstream in_file( s1.c_str() );
// wifstream in_file( s1.c_str() ); // this throws an exception when I 
                                    // call in_file >> s;
string s;
in_file >> s; // s now contains «¬

wstring ws = textToWide(s);

wcout << ws << endl; // these two lines work independently of each other,
                     // but combining them makes the second one print incorrectly
cout << s << endl;
printf( "%s", s.c_str() ); // same case here, these work independently of one another,
                           // but calling one after the other makes the second call
                           // print incorrectly
wprintf( L"%s", ws.c_str() );

wstring textToWide(string s)
{
    mbstate_t mbstate;
    char *cc = new char[s.length() + 1];
    strcpy(cc, s.c_str());
    cc[s.length()] = 0;
    size_t numbytes = mbsrtowcs(0, (const char **)&cc, 0, &mbstate);
    wchar_t *buff = new wchar_t[numbytes + 1];
    mbsrtowcs(buff, (const char **)&cc, numbytes + 1, &mbstate);
    wstring ws = buff;
    delete [] cc;
    delete [] buff;
    return ws;
}

似乎调用wcout和wprintf会以某种方式破坏流，并且只要字符串被编码为utf-8，就可以安全地调用cout和printf。

处理unicode的最佳方法是在处理之前将所有输入转换为宽，并在发送到outupt之前将所有输出转换为utf-8？

Answer 1

处理Unicode的最全面方法是使用Unicode库，例如ICU。 Unicode比一堆编码有更多的方面。 C ++不提供API来处理任何这些额外的方面。 ICU。

如果你只想处理编码，那么一种有点工作的方法是正确使用内置的C ++方法。这包括致电

std::setlocale(LC_ALL, 
               /*some system-specific locale name, probably */ "en_US.UTF-8")

在程序的开头。另外，请勿在同一程序中使用cout / printf和wcout / wprintf。（您可以在同一程序中使用除标准句柄以外的常规和宽流对象。）

将所有输入转换为宽并将所有输出转换为utf-8是一种合理的策略。使用utf-8也是合理的。很大程度上取决于您的应用。 C ++ 11具有内置的UTF8，UTF16和UTF32字符串类型，可以在某种程度上简化任务。

无论您做什么，都不要在字符串文字中使用扩展字符集的元素。（在C ++ 11中，可以在UTF8 / 16/32字符串文字中使用它们。）

c ++和g ++如何处理unicode？

1 个答案: