Question

我正在编写一个网络抓取工具来获取一些中文网络文件。获取的文件以utf-8编码。我需要读取这些文件来进行一些解析，例如提取URL和中文字符。但我发现当我将文件读入std :: string变量并将其输出到控制台时，中文字符变为垃圾字符。我将boost :: regex应用到std :: string变量中，并且可以提取除中文字符之外的所有URL。

我如何解决这些问题？

P.S。我的CPP文件默认编码为ANSI，操作系统是Win8中文语言;

Answer 1

此代码可能有所帮助（它是使用VC ++ 2010编译的）。我用包含非拉丁字符的UTF-8文件测试它似乎工作，但我不知道它是否适用于中文字符。请查看以下链接以获取更多信息：_setmode和codecvt_utf8。

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>
#include <fcntl.h>
#include <io.h>

using namespace std;    // Sorry for this!

void read_all_lines(const wchar_t *filename)
{
    wifstream wifs;
    wstring txtline;
    int c = 0;

    wifs.open(filename);
    if(!wifs.is_open())
    {
        wcerr << L"Unable to open file" << endl;
        return;
    }
    // We are going to read an UTF-8 file
    wifs.imbue(locale(wifs.getloc(), new codecvt_utf8<wchar_t, 0x10ffff, consume_header>()));
    while(getline(wifs, txtline))
        wcout << ++c << L'\t' << txtline << L'\n';
    wcout << endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
    // Console output will be UTF-16 characters
    _setmode(_fileno(stdout), _O_U16TEXT);
    if(argc < 2)
    {
        wcerr << L"Filename expected!" << endl;
        return 1;
    }
    read_all_lines(argv[1]);
    return 0;
}

如果中文字符看起来不像预期的那样，请确保控制台使用的是支持UTF-16的字体（即不使用位图字体）。

Answer 2

通常，使用w变体，（wstring，wfstream，wcout），设置您的语言环境以符合要求，挂起L在字符串文字的前面。 locale::global(locale(""))设置为匹配环境默认值，然后设置为未按照默认值运行的每个流。 wcout.imbue(locale("Chinese_China.936")) might be Microsoft's name用于终端的区域设置。这总是足以做我想做的事情，希望它也适合你。

#include <iostream>
#include <locale>
using namespace std;
int main() {
  locale::global(locale(""));
  wstring word;
  while (wcin >>word)
    wcout<<word<<'\n';
  wcout<<L"好運n";
}

Answer 3

如果需要正确显示字符，可以使用GNU的libiconv。如果你只需要处理url，std :: string工作正常。问题是Windows控制台的代码页，而不是字符串本身。使用locale取决于os和stdc ++ lib的实现，所以我不鼓励使用。

窗口的MultiByteToWideChar可能会有所帮助，但您需要检查MS关于函数如何执行字符串转换的规范。

如何读取包含中文字符的UTF-8编码文件并在控制台上正确输出？

3 个答案: