Question

我正在尝试帮助一个朋友，他的项目应该是1H，现在已经是3天了。不用说，我感到非常沮丧和愤怒;-) ooooouuuu ...我呼吸。

所以用C ++编写的程序只是读了一堆文件并处理它们。问题是我的程序读取使用UTF-16编码的文件（因为文件包含用不同语言编写的文字）和ifstream的简单使用似乎不起作用（它读取和输出垃圾）。我花了一段时间才意识到这是因为文件是UTF-16。

现在我花了整整一个下午在网上试图找到有关READED UTF16文件的信息并将UTF16行的内容转换为char！我似乎无法做到！这是一场噩梦。我尝试了解我之前从未使用过的<locale>和<codecvt>，wstring等等（我专注于图形应用程序，而不是桌面应用程序）。我无法得到它。

这就是我所做的事情（但不起作用）：

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}

这是我能想到的最大值，但它甚至不起作用。它没有做任何更好的事情。但问题是我无论如何都不明白我在做什么。

请帮助！这真的很疯狂，我甚至可以阅读G *** D ***文本文件。

最重要的是，我的朋友使用Ubuntu（我使用clang ++），这段代码需要-stdlib = libc ++，gcc似乎并不支持他（尽管他使用了非常高级的gcc版本，这是4.6.3我相信）。所以我甚至不确定使用codecvt和locale是个好主意（如“可能”）。是否会有更好的（另一种）选择。

如果我只是从命令行（使用linux命令）将所有文件转换为utf-8，我可能会丢失信息吗？

非常感谢，如果你帮助我，我将永远感激你。

Answer 1

如果我只是从命令行（使用linux命令）将所有文件转换为utf-8，我可能会丢失信息吗？

不，所有UTF-16数据都可以无损转换为UTF-8。这可能是最好的事情。

当引入宽字符时，它们应该是一个专门用于程序内部的文本表示，而不是作为宽字符写入磁盘。宽流通过将您写出的宽字符转换为输出文件中的窄字符，并在读取时将文件中的窄字符转换为内存中的宽字符来反映这一点。

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

当然，实际编码取决于流的嵌入式语言环境中的codecvt方面，但流的作用是使用codecvt从wchar_t转换为char在撰写时使用该方面，并在阅读时从char转换为wchar_t。

然而，由于有些人开始用UTF-16编写文件，其他人只需处理它。他们使用C ++流的方式是创建codecvt方面，将char视为持有一半UTF-16代码单元，这是codecvt_utf16所做的。

通过这种解释，以下是您的代码存在的问题：

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

这是重写上述内容的一种方法：

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}

Answer 2

我改编，纠正并测试了Mats Petersson令人印象深刻的解决方案。

int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif

Answer 3

UTF-8能够表示所有有效的Unicode字符（代码点），这比UTF-16（覆盖前110万个代码点）要好。 [虽然，正如评论所解释的那样，没有超过110万值的有效Unicode代码点，因此UTF-16对于所有当前可用的代码点都是“安全的” - 并且可能在很长一段时间内，除非我们确实得到了一些具有非常复杂的写作语言的额外地面访问者......]

在必要时，它通过使用多个字节/单词来存储单个代码点（我们称之为字符）。在UTF-8中，这是由设置的最高位标记的 - 在“多字节”字符的第一个字节中，前两位设置，在后面的字节中设置最高位，然后是下一个字节从顶部是零。

要将任意代码点转换为UTF-8，您可以使用我previous answer中的代码。（是的，这个问题谈到了你所要求的相反，但我的答案中的代码涵盖了转换的两个方向）

除了输入的长度之外，从UTF16转换为“整数”将是一种类似的方法。如果你很幸运，你甚至可能会因为不这样做而逃脱......

UTF16使用范围D800-DBFF作为第一部分，它保存10位数据，然后下面的项目是DC00-DFFF，保存以下10位数据。

要遵循的16位代码......

16位到32位转换的代码（我只测试了一下这个，但它似乎工作正常）：

std::vector<int> utf32_to_utf16(int charcode)
{
    std::vector<int> r;
    if (charcode < 0x10000)
    {
    if (charcode & 0xFC00 == 0xD800)
    {
        std::cerr << "Error bad character code" << std::endl;
        exit(1);
    }
    r.push_back(charcode);
    return r;
    }
    charcode -= 0x10000;
    if (charcode > 0xFFFFF)
    {
    std::cerr << "Error bad character code" << std::endl;
    exit(1);
    }
    int coded = 0xD800 | ((charcode >> 10) & 0x3FF);
    r.push_back(coded);
    coded = 0xDC00 | (charcode & 0x3FF);
    r.push_back(coded);
    return r;
}


int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF) | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}

C ++ UTF-16到char转换（Linux / Ubuntu）

3 个答案: