Question

因此，我正在编写一个程序，以将中英文定义.txt文件转换为通过CLI运行的vocab训练器。但是，在Windows中，当我尝试在VS2017中进行编译时，它变成乱码，我不确定为什么。我认为它在Linux中工作正常，但Windows似乎将其弄乱了很多。这与Windows中的编码表有关吗？我想念什么吗？我在Linux和输入文件中都编写了代码，但是我尝试使用Windows IME编写字符，但结果仍然相同。我认为图片最能说明问题。谢谢

注意：根据要求添加了Windows中出现的输入/输出示例。另外，输入的是UTF-8。

输入样本

人(rén),person
刀(dāo),knife
力(lì),power
又(yòu),right hand; again
口(kǒu),mouth

输出样本

Σ║║(r├⌐n),person
σêÇ(d─üo),knife
σè¢(l├¼),power
σÅê(y├▓u),right hand; again
σÅú(k╟Æu),mouth
σ£ƒ(t╟ö),earth

Picture of Input file & Output

Answer 1

TL; DR：Windows终端讨厌Unicode。您可以解决它，但这并不漂亮。

您在这里遇到的问题与“ char和wchar_t”无关。实际上，您的程序没有任何问题！仅当文本通过cout离开并到达终端时才出现问题。

您可能已经习惯将char视为“字符”；这是一个常见的（但可以理解的）误解。在C / C ++中，char类型通常是 8位整数的同义词，因此可以更准确地描述为 byte 。

您的文本文件 chineseVocab.txt 编码为UTF-8。通过fstream读取此文件时，得到的是 UTF-8编码字节的字符串。

I / O中没有“字符”之类的东西；您始终以特定的编码传输字节。在您的示例中，您正在从文件句柄（fin）中读取UTF-8编码的字节。

尝试运行此命令，您将在两个平台（Windows和Linux）上看到相同的结果：

int main()
{
    fstream fin("chineseVocab.txt");
    string line;
    while (getline(fin, line))
    {
        cout << "Number of bytes in the line: " << dec << line.length() << endl;
        cout << "    ";
        for (char c : line)
        {
            // Here we need to trick the compiler into displaying this "char" as an integer:
            unsigned int byte = (unsigned char)c;
            cout << hex << byte << "  ";
        }
        cout << endl;
        cout << endl;
    }
    return 0;
}

这是我在Windows中看到的内容：

Number of bytes in the line: 16
    e4  ba  ba  28  72  c3  a9  6e  29  2c  70  65  72  73  6f  6e

Number of bytes in the line: 15
    e5  88  80  28  64  c4  81  6f  29  2c  6b  6e  69  66  65

Number of bytes in the line: 14
    e5  8a  9b  28  6c  c3  ac  29  2c  70  6f  77  65  72

Number of bytes in the line: 27
    e5  8f  88  28  79  c3  b2  75  29  2c  72  69  67  68  74  20  68  61  6e  64  3b  20  61  67  61  69  6e

Number of bytes in the line: 15
    e5  8f  a3  28  6b  c7  92  75  29  2c  6d  6f  75  74  68

到目前为止，很好。

问题现在开始：您要将相同的UTF-8编码字节写入另一个文件句柄（cout）。

cout文件句柄已连接到您的CLI（“终端”，“控制台”，“ shell”，无论您要使用什么名称）。 CLI从cout中读取字节，并解码它们为字符，以便可以显示它们。

Linux终端通常配置为使用 UTF-8解码器。好消息！ 您的字节采用UTF-8编码，因此您Linux终端的解码器匹配文本文件的编码。这就是为什么一切在终端上看起来都很好的原因。
Windows终端通常配置为使用依赖系统的解码器（您的浏览器似乎是DOS codepage 437）。坏消息！ 您的字节采用UTF-8编码，因此Windows终端的解码器与文本文件的编码不匹配。因此，终端中的所有内容看起来都是乱码。

好，那您怎么解决呢？不幸的是，我找不到任何可移植的方法...您需要将程序分为Linux版本和Windows版本。在Windows版本中：

将您的UTF-8字节转换为UTF-16代码单元。
将标准输出设置为UTF-16模式。
写入wcout而不是cout
告诉您的用户将其终端更改为支持汉字的字体。

代码如下：

#include <fstream>
#include <iostream>
#include <string>

#include <windows.h>

#include <fcntl.h>  
#include <io.h>  
#include <stdio.h> 

using namespace std;

// Based on this article:
// https://msdn.microsoft.com/magazine/mt763237?f=255&MSPPError=-2147217396
wstring utf16FromUtf8(const string & utf8)
{
    std::wstring utf16;

    // Empty input --> empty output
    if (utf8.length() == 0)
        return utf16;

    // Reject the string if its bytes do not constitute valid UTF-8
    constexpr DWORD kFlags = MB_ERR_INVALID_CHARS;

    // Compute how many 16-bit code units are needed to store this string:
    const int nCodeUnits = ::MultiByteToWideChar(
        CP_UTF8,       // Source string is in UTF-8
        kFlags,        // Conversion flags
        utf8.data(),   // Source UTF-8 string pointer
        utf8.length(), // Length of the source UTF-8 string, in bytes
        nullptr,       // Unused - no conversion done in this step
        0              // Request size of destination buffer, in wchar_ts
    );

    // Invalid UTF-8 detected? Return empty string:
    if (!nCodeUnits)
        return utf16;

    // Allocate space for the UTF-16 code units:
    utf16.resize(nCodeUnits);

    // Convert from UTF-8 to UTF-16
    int result = ::MultiByteToWideChar(
        CP_UTF8,       // Source string is in UTF-8
        kFlags,        // Conversion flags
        utf8.data(),   // Source UTF-8 string pointer
        utf8.length(), // Length of source UTF-8 string, in bytes
        &utf16[0],     // Pointer to destination buffer
        nCodeUnits     // Size of destination buffer, in code units          
    );

    return utf16;
}

int main()
{
    // Based on this article:
    // https://blogs.msmvps.com/gdicanio/2017/08/22/printing-utf-8-text-to-the-windows-console/
    _setmode(_fileno(stdout), _O_U16TEXT);

    fstream fin("chineseVocab.txt");
    string line;
    while (getline(fin, line))
        wcout << utf16FromUtf8(line) << endl;
    return 0;
}

在我的终端中，将字体更改为 MS Gothic （MS哥特式）后，它通常看起来不错：

有些字符仍然被弄乱了，但这是由于字体不支持它们。

为什么汉字通过编译器运行后变成乱码？

1 个答案: