Question

我确定这里缺少什么，但是我正在将常规字符串文字（在utf8编码的文档中）的内容与u8字符串文字进行比较，在Windows上，u8编码的文字不包含预期的utf8在Linux上也可以编码数据。

详细信息：

cpp文件是utf8编码的
启用了C ++ 17
在Windows上使用vs 2019进行编译
在Linux上使用gcc 9.2.1进行编译

代码如下：

#include <iostream>
#include <string>

struct HexCharStruct {
    unsigned char c;
    HexCharStruct(unsigned char _c) : c(_c) { }
};

inline std::ostream& operator<<(std::ostream& o, const HexCharStruct& hs) {
    return (o << std::hex << (int)hs.c);
}

inline HexCharStruct hex(unsigned char _c) {
    return HexCharStruct(_c);
}

int main( int argc, char** argv ) {

    std::string s1 = "?";
    std::string s2 = u8"?";

    std::cout << "s1: ";
    for (const char& c : s1)
        std::cout << hex(c) << " ";
    std::cout << "\ns2: ";
    for (const char& c : s2)
        std::cout << hex(c) << " ";

    return 0;
}

这是运行此命令时在Windows和Linux上为s1和s2打印的十六进制值：

s1（Windows）： f0 9f 8e 82
s1（Linux）： f0 9f 8e 82
s2（Windows）： c3 b0 c5 b8 c5 bd e2 80 9a
s2（Linux）： f0 9f 8e 82

?的utf8十六进制值为 f0 9f 8e 82 ，因此除Windows上的s2之外，其他所有内容均与预期的一样。谁能解释一下？

Answer 1

Microsoft编译器假定源是ANSI编码的，这取决于所使用Windows的本地化版本。在美国和西欧Windows上，假定编码为Windows-1252。

当编译器假定Windows-1252时，它将以错误的编码方式对源中编码的UTF-8字节进行解码，并认为它是四个Windows-1252字符，然后对那些字符进行编码 UTF-8中的字符。快速演示（Python）：

>>> '?'.encode('utf8') # bytes in the file
b'\xf0\x9f\x8e\x82'
>>> b'\xf0\x9f\x8e\x82'.decode('Windows-1252') # What the compiler reads.
'ðŸŽ‚'
>>> 'ðŸŽ‚'.encode('utf8') # What the compiler generates for u8 string.
b'\xc3\xb0\xc5\xb8\xc5\xbd\xe2\x80\x9a'

要使用UTF-8源，有两个选项是使用带BOM的UTF-8对源进行编码或添加/utf-8编译器开关。

C ++ u8文字-Windows上的意外编码

1 个答案: