Question

当我用外国字符（法语......）写单词时似乎有问题。

例如，如果我要求输入std :: string或char []，如下所示：

std::string s;
std::cin>>s;  //if we input the string "café"
std::cout<<s<<std::endl;  //outputs "café"

一切都很好。

虽然字符串是硬编码的

std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"

发生了什么事？ C ++支持哪些字符以及如何使其正常工作？它与我的操作系统（Windows 10）有关吗？我的IDE（VS 15）？还是用C ++？

Answer 1

简而言之，如果要在Windows 10（实际上是任何版本的Windows）上向/从控制台传递/接收unicode文本，则需要使用宽字符串IE，std :: wstring。 Windows本身不支持UTF-8编码。这是操作系统的基本限制。

控制台和文件系统访问等基础的整个Win32 API仅适用于UTF-16编码下的unicode字符，Visual Studio中提供的C / C ++运行时不提供任何类型转换层使这个API与UTF-8兼容。这并不意味着您无法在内部使用UTF-8编码，这只是意味着当您点击Win32 API或使用它的C / C ++运行时功能时，您需要转换在UTF-8和UTF-16编码之间。它很糟糕，但它就在我们现在的位置。

有些人可能会引导您使用一系列技巧来使控制台与UTF-8一起使用。不要走这条路，你会遇到很多问题。 unicode控制台访问只能正确支持宽字符串。

编辑：因为UTF-8 / UTF-16字符串转换非常重要，而且在C ++中也没有为此提供太多帮助，这里有一些我准备的转换函数早期：

///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
    // Convert the encoding of the supplied string
    std::wstring stringUTF16;
    size_t sourceStringPos = 0;
    size_t sourceStringSize = stringUTF8.size();
    stringUTF16.reserve(sourceStringSize);
    while (sourceStringPos < sourceStringSize)
    {
        // Determine the number of code units required for the next character
        static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
        unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];

        // Ensure that the requested number of code units are left in the source string
        if ((sourceStringPos + codeUnitCount) > sourceStringSize)
        {
            break;
        }

        // Convert the encoding of this character
        switch (codeUnitCount)
        {
        case 1:
        {
            stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
            break;
        }
        case 2:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
            stringUTF16.push_back((wchar_t)unicodeCodePoint);
            break;
        }
        case 3:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
            stringUTF16.push_back((wchar_t)unicodeCodePoint);
            break;
        }
        case 4:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
            wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
            wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
            stringUTF16.push_back(convertedCodeUnit1);
            stringUTF16.push_back(convertedCodeUnit2);
            break;
        }
        }

        // Advance past the converted code units
        sourceStringPos += codeUnitCount;
    }

    // Return the converted string to the caller
    return stringUTF16;
}

///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
    // Convert the encoding of the supplied string
    std::string stringUTF8;
    size_t sourceStringPos = 0;
    size_t sourceStringSize = stringUTF16.size();
    stringUTF8.reserve(sourceStringSize * 2);
    while (sourceStringPos < sourceStringSize)
    {
        // Check if a surrogate pair is used for this character
        bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);

        // Ensure that the requested number of code units are left in the source string
        if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
        {
            break;
        }

        // Decode the character from UTF-16 encoding
        unsigned int unicodeCodePoint;
        if (usesSurrogatePair)
        {
            unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
        }
        else
        {
            unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
        }

        // Encode the character into UTF-8 encoding
        if (unicodeCodePoint <= 0x7F)
        {
            stringUTF8.push_back((char)unicodeCodePoint);
        }
        else if (unicodeCodePoint <= 0x07FF)
        {
            char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
            char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
        }
        else if (unicodeCodePoint <= 0xFFFF)
        {
            char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
            char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
            char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
            stringUTF8.push_back(convertedCodeUnit3);
        }
        else
        {
            char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
            char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
            char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
            char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
            stringUTF8.push_back(convertedCodeUnit3);
            stringUTF8.push_back(convertedCodeUnit4);
        }

        // Advance past the converted code units
        sourceStringPos += (usesSurrogatePair) ? 2 : 1;
    }

    // Return the converted string to the caller
    return stringUTF8;
}

我负责将600万行遗留Windows应用程序转换为支持Unicode的难以置信的任务，当时它只支持ASCII（实际上它的开发时间早于Unicode），我们使用std :: string和char []内部存储字符串。由于根本无法更改所有内部字符串存储缓冲区，因此我们需要在内部采用UTF-8，并在命中Win32 API时在UTF-8和UTF-16之间进行转换。这些是我们使用的转换函数。

我强烈建议坚持使用新Windows开发所支持的功能，这意味着需要广泛的字符串。也就是说，没有理由你可以将程序的核心基于UTF-8字符串，但是当与Windows和C / C ++运行时的各个方面进行交互时，它会使事情变得更加棘手。 / p>

编辑2：我刚刚重新阅读原始问题，我可以看到我没有很好地回答。让我提供更多信息，专门回答你的问题。

发生了什么？在Windows上使用C ++进行开发时，当您将std :: string与std :: cin / std :: cout一起使用时，控制台IO正在使用MBCS编码完成。这是一种不推荐使用的模式，在该模式下，使用计算机上当前选定的code page对字符进行编码。在这些代码页下编码的值不是unicode，不能与选择了不同代码页的其他系统共享，如果更改代码页，则甚至不能与同一系统共享。它在您的测试中完美运行，因为您在当前代码页下捕获输入，并在同一代码页下显示它。如果您尝试捕获该输入并将其保存到文件中，则检查将显示它不是unicode。使用在我们的操作系统中选择的不同代码页将其加载回来，文本将显示为已损坏。如果您知道编码的代码页是什么，则只能解释文本。由于这些遗留代码页是区域性的，并且它们都不能代表所有文本字符，因此实际上无法在不同的计算机和计算机之间共享文本。 MBCS早于unicode的发展，特别是因为发明了unicode这类问题。 Unicode基本上是一个代码页面来统治它们所有＆＃34;。你可能想知道为什么UTF-8不是一个可选的＆＃34;遗产＆＃34; Windows上的代码页。我们很多人都想知道同样的事情。我只想说，它不是。因此，您不应该依赖MBCS编码，因为在使用它时无法获得unicode支持。 Windows上unicode支持的唯一选择是使用std :: wstring，并调用UTF-16 Win32 API。

关于硬编码字符串的示例，首先要了解将非ASCII文本编码到源文件中会使您进入特定于编译器的行为领域。在Visual Studio中，您实际上可以指定源文件的编码（在“文件” - >“高级保存选项”下）。在您的情况下，文本与您期望的不同，因为它在UTF-8中被编码（最有可能），但如上所述，控制台输出正在使用您的MBCS编码完成当前选择的代码页，不是UTF-8。从历史上看，建议您避免在源文件中使用任何非ASCII字符，并使用\ x表示法转义任何字符。今天，有C ++ 11 string prefixes and suffixes保证各种编码形式。如果您需要此功能，可以尝试使用它们。我没有使用它们的实际经验，因此我无法建议这种方法是否存在任何问题。

Answer 2

问题源于Windows本身。它对大多数内部操作使用一个字符编码（UTF-16），对于默认文件编码使用另一个（Windows-1252），对于控制台I / O使用另一个（Code Page 850）。您的源文件在Windows-1252中编码，其中é等同于单个字节'\xe9'。当您在代码页850中显示相同的代码时，它变为Ú。使用u8"é"生成两个字节序列"\xc3\xa9"，它在控制台上以├®打印。

最简单的解决方案可能是避免在代码中放入非ASCII文字，并使用十六进制代码表示所需的字符。但这不是一个漂亮或便携的解决方案。

std::string s="caf\x82";

更好的解决方案是使用u16字符串并使用WideCharToMultiByte对其进行编码。

Answer 3

C ++支持哪些字符

C ++标准版并未指定支持哪些字符。它是特定于实现的。

是否与...有关...

... C ++？

没有

...我的IDE？

不，虽然IDE可能有选项来编辑特定编码的源文件。

......我的操作系统？

这可能会产生影响。

这受到几件事的影响。

源文件的编码是什么。
编译器用于解释源文件的编码是什么。
- 是否与文件的编码相同或不同（它应该相同或可能无法正常工作）。
- 操作系统的本机编码可能会影响编译器默认的字符编码。
运行程序的终端支持哪种编码。
- 是否与文件的编码相同，或者不同（它应该相同，否则在没有转换时可能无法正常工作）。
使用的字符编码宽。宽，我的意思是代码单元的宽度是否大于CHAR_BIT。由于您使用窄字符串文字和窄流操作符，因此宽源/编译器将导致转换为另一种窄编码。在这种情况下，您需要找出编译器所期望的本机窄字符和本机宽字符编码。编译器将输入字符串转换为窄编码。如果窄编码在输入编码中没有表示该字符，则可能无法正常工作。

一个例子：

源文件以UTF-8编码。编译器期望UTF-8。终端期望UTF-8。在这种情况下，你看到的就是你得到的。

Answer 4

这里的诀窍是setlocale：

<?xml version="1.0" encoding="utf-8"?>
<shape xmlns:android="http://schemas.android.com/apk/res/android" android:shape="rectangle" >
    <corners
        android:radius="14dp"
        />
    <gradient
        android:angle="45"
        android:centerX="35%"
        android:centerColor="COLOR1"
        android:startColor="COLOR2"
        android:endColor="COLOR3"
        android:type="linear"
        />
    <padding
        android:left="0dp"
        android:top="0dp"
        android:right="0dp"
        android:bottom="0dp"
        />
    <size
        android:width="270dp"
        android:height="60dp"
        />
    <stroke
        android:width="3dp"
        android:color="#878787"
        />
</shape>

即使不更改终端代码页，使用Windows 10命令提示符的输出也是正确的。

C ++支持的字符

4 个答案: