Question

在Linux环境中，我有一段用于读取unicode文件的代码，类似如下所示。

但是，特殊字符（如丹麦字母æ，ø和å）未正确处理。对于'abcæøåabc'这一行，输出只是'abc'。使用调试器，我可以看到wline的内容也只是a\000b\000c\000。

#include <fstream>
#include <string>

std::wifstream wif("myfile.txt");
if (wif.is_open())
{
    //set proper position compared to byteorder
    wif.seekg(2, std::ios::beg);
    std::wstring wline;

    while (wif.good())
    {
        std::getline(wif, wline);
        if (!wif.eof())
        {
            std::wstring convert;
            for (auto c : wline)
            {
                if (c != '\0')
                convert += c;
            }
        }
    }
}
wif.close();

任何人都可以告诉我如何阅读全部内容吗？

谢谢和问候

Answer 1

您必须使用imbue()方法告诉wifstream该文件已编码为UTF-16，并让它为您使用BOM。您不必手动seekg()通过BOM。例如：

#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

// open as a byte stream
std::wifstream wif("myfile.txt", std::ios::binary);
if (wif.is_open())
{
    // apply BOM-sensitive UTF-16 facet
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

    std::wstring wline;
    while (std::getline(wif, wline))
    {
        std::wstring convert;
        for (auto c : wline)
        {
            if (c != L'\0')
                convert += c;
        }
    }

    wif.close();
}

使用std :: wifstream读取带有特殊字符的unicode文件

1 个答案: