Question

我正在读取UTF-8编码的unicode文本文件，并将其输出到控制台，但显示的字符与我用于创建文件的文本编辑器中的字符不同。这是我的代码：

#define UNICODE

#include <windows.h>
#include <iostream>
#include <fstream>
#include <string>

#include "pugixml.hpp"

using std::ifstream;
using std::ios;
using std::string;
using std::wstring;

int main( int argc, char * argv[] )
{
    ifstream oFile;

    try
    {
        string sContent;

        oFile.open ( "../config-sample.xml", ios::in );

        if( oFile.is_open() )
        {
            wchar_t wsBuffer[128];

            while( oFile.good() )
            {
                oFile >> sContent;
                mbstowcs( wsBuffer, sContent.c_str(), sizeof( wsBuffer ) );
              //wprintf( wsBuffer );// Same result as wcout.
                wcout << wsBuffer;
            }

            Sleep(100000);
        }
        else
        {
            throw L"Failed to open file";
        }
    }
    catch( const wchar_t * pwsMsg )
    {
        ::MessageBox( NULL, pwsMsg, L"Error", MB_OK | MB_TOPMOST | MB_SETFOREGROUND );
    }

    if( oFile.is_open() )
    {
        oFile.close();
    }

    return 0;
}

我必须有一些关于编码的内容。

Answer 1

宽字符串不代表UTF-8。事实上，它完全相反：UTF-8意味着Unicode转换格式（8位）;这是一种用8位字符表示Unicode的方法，所以你的正常char。你应该把它读成普通的字符串（不是宽字符串）。

宽字符串使用wchar_t，在Windows上为16位。操作系统使用UTF-16作为其“广泛”功能。

在Windows上，可以使用MultiByteToWideChar将UTF-8字符串转换为UTF-16。

Answer 2

问题是mbstowcs实际上并不使用UTF-8。它使用旧式的“多字节代码点”，它与UTF-8不兼容（虽然技术上可以[我相信]定义一个UTF-8代码页，但在Windows中没有这样的东西）。

如果您想将UTF-8转换为UTF-16，可以使用codepage，CP_UTF8 {{1}}。

Answer 3

我创建了一个C ++ <div class="container" ng-controller="FieldCtrl"> <link rel="stylesheet" href="../css/circles.css" type="text/css" /> <div class="red-circle"> </div> </div>容器，最多容纳6个8位char_t，将其存储在char_t中。将其转换为std::vector或将其附加到wchar_t。

在这里查看： View UTF-8_String structures on Github

std::string

这是在上面的标题中将wchar_t转换为u8char结构中的uint32_t的函数。

#include "UTF-8_String.h" //header from github link above

iBS::u8str  raw_v;
iBS::readu8file("TestUTF-8File.txt",raw_v);
std::cout<<raw_v.str()<<std::endl;

添加了我在头文件中称为rem开关的内容，将UnicodeInt从uint64_t切换回uint32_t。第一个uint64_t是默认值。

//-----------------------------------------------
    /u8char(wchar_t ch):ref(1)
        {   
            char temp[6];
            std::mbstate_t state ;
            int ret = std::wcrtomb((&temp[0]), ch, &state);
            ref.resize(ret);
            for (short i=0; i<ret; ++i) 
                ref[i]=temp[i];
        };

或

/*  rem switch to change from 32 bit int - 64 bit int
#define UnicodeInt uint32_t 
/*/    
#define UnicodeInt uint64_t
//*/

Answer 4

我发现 wifstream 效果很好，即使在 Visual Studio 调试器中也能正确显示 UTF-8 字词（我正在阅读繁体中文字词），来自 this post：

#include <sstream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename);
    wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}
 
//  usage
std::wstring wstr2;
wstr2 = readFile("C:\\yourUtf8File.txt");
wcout << wstr2;

C ++ / wcout / UTF-8

4 个答案: