Question

我尝试使用WinInet库读取UTF-8编码的网页。

这是我的一些代码：

HINTERNET hUrl = ::InternetOpenUrl(hInet, wurl.c_str(),NULL,NULL,NULL,NULL);
    CHAR buffer[65536];
    std::wstring full_content;
    std::wstring read_content;
    DWORD number_of_bytes_read=1;

    while(number_of_bytes_read)
    {
        ::InternetReadFile(hUrl, buffer, 65536, &number_of_bytes_read);
    //  ::InternetReadFileExW(hUrl, &buffersw, IRF_SYNC,NULL);
            //((hUrl,buffer,65536,&number_of_bytes_read);
        read_content.resize(number_of_bytes_read);

        ::MultiByteToWideChar(CP_ACP,MB_COMPOSITE,
                     &buffer[0],number_of_bytes_read,
                     &read_content[0],number_of_bytes_read);
        full_content.append(read_content);
        //readed_content.append(buffer,number_of_bytes_read);
    }

我正确地看到了英文符号，但是我看到了垃圾，而不是俄罗斯符号。它有什么用呢？提前致谢。

Answer 1

您的网页是UTF-8，但您使用ANSI代码页（CP_ACP）对其进行解码。请改用CP_UTF8

Answer 2

将CP_ACP更改为CP_UTF8，将MB_COMPOSITE更改为0

来自文档

对于UTF-8或代码页54936（GB18030，从Windows Vista开始），dwFlags必须设置为0或MB_ERR_INVALID_CHARS。否则，该函数将失败并显示ERROR_INVALID_FLAGS。

Answer 3

根本不要转换。将UTF-8保存在内存中。仅在与Windows API函数交互时转换为UTF-16。

http://utf8everywhere.org中有关此方法的更多信息。

将一串多字节字符转换为widechar会产生意外结果

3 个答案: