Question

我在打开名称中包含Unicode字符的文件时遇到问题。我在桌面上创建了一个文件，只有几行文字。

C：\用户\詹姆斯\桌面\你好世界的.txt

编辑：我正在使用CLion。 CLion以unicode传递参数。

当我将该字符串放入Windows运行对话框时，它会找到该文件并将其打开。

有趣的是，我在调用CommandLineToArgvW时在文件夹名称中得到了双L'\\' L'\\'：
L"c:\\\\users\\\\james\\\\desktop\\\\你好世界.txt"

所以我编写了一个小例程来将文件名复制到另一个wchar_t *并删除斜杠。仍然无法工作。

errno == 2和f == NULL。

size_t filename_max_len = wcslen(filename);

//strip double slashes
wchar_t proper_filename[MAX_PATH + 1];

wchar_t previous = L'\0';
size_t proper_filename_location = 0;

for(int x = 0; x < filename_max_len; ++x)
{
    if(previous == L'\\' && filename[x] == L'\\')
        continue;

    previous = filename[x];
    proper_filename[proper_filename_location++] = filename[x];
}

proper_filename[proper_filename_location] = L'\0';

//Read in binary mode to prevent the C system from screwing with line endings
FILE *f = _wfopen(proper_filename, L"rb");

int le = errno;

if (f == NULL)
{
    perror(strerror(le));

    if(le == ERROR_FILE_NOT_FOUND)
    {
        return DUST_ERR_FILE_NOT_FOUND;
    }
    else {
        return DUST_ERR_COULD_NOT_OPEN_FILE;
    }
}

Answer 1

我已经弄明白了这个问题。我的预感是正确的。 CLion似乎提供unicode作为程序的输入。使用Windows运行对话框并将其作为参数传递给我的程序，我能够毫无问题地打开并处理该文件。

Answer 2

我的第一个猜测是，228,189,160代表你的文件名编码为UTF-8字节序列的第一个字符，因为它看起来像我这样的序列。 E4 BD A0（228,189,160）解码为U + 4F60，它实际上是与第一个字符对应的Unicode代码点。

我修改了my sample program here中main的输出部分，将每个参数打印为十六进制编码的字节序列。我将您的路径复制并粘贴为程序的参数，汉字符号以UTF-8编码为：

E4 BD A0
E5 A5 BD
E4 B8 96
E7 95 8C

您的评论提到略有不同的数字（特别是8211 / U + 2013,8226 / U + 2022和338 / U + 0152）。查看代码页Windows 1250和Windows 1252，两个代码页中的字节0x96,0x95和0x8C分别对应于U + 2013，U + 2022和U + 0152。我猜你的原始程序遇到Unicode输入时出错了（你使用GetCommandLineW并将其传递给CommandLineToArgvW，对吗？）

这是我编辑的输出的屏幕截图，以突出显示相关的字符序列（¥字形意味着\字形，但我使用code page 932表示cmd。 EXE）：

program output with highlighted UTF-8 bytes

无法使用CLion在文件名中打开包含unicode字符的文件

2 个答案: