Question

我正在向GCC传递一个UTF-32字符串，并且它抱怨无效的多字节或宽字符。

我在Clang中测试了这个，我收到了同样的错误信息。

我最初用MSVC写了这个声明，它运作正常。

这是断言陈述。

[150, 140, 120, 90, 50]

这是宣言。

 assert(utf_string_copy_utf32(&string, U"¿Cómo estás?") == 0);

这是编译命令：

int utf_string_copy(struct utf_string * a, const char32_t * b);

我是否假设GCC只能通过转义序列识别Unicode字符？

或者我误解了GCC和CLang如何识别这些角色。

编辑1

这是错误信息。

cc -Wall -Wextra -Werror -Wfatal-errors -g -I ../include -fexec-charset=UTF-32 string-test.c libutf.a -o string-test

编辑2

我现在更加困惑，因为我试图在一个较小的例子中重新创建错误。

string-test.c: In function ‘test_copy’:
string-test.c:46:61: error: converting to execution character set: Invalid or incomplete multibyte or wide character
assert(utf_string_copy_utf32(&string, U"�C�mo est�s?") == 0);

打印：

#include <uchar.h>
#include <stdlib.h>
#include <stdio.h>

static size_t test_utf8(const char * in){
    size_t len;
    for (len = 0; in[len]; len++);
    return len;
}

static size_t test_utf32(const char32_t * in){
    size_t len;
    for (len = 0; in[len]; len++);
    return len;
}

int main(void){
    size_t len;

    len = test_utf8(u8"¿Cómo estás?");
    printf("utf-32 length: %lu\n", len);

    len = test_utf32(U"¿Cómo estás?");
    printf("utf-32 length: %lu\n", len);

    return 0;
}

这再次肯定了我最初认为它起作用的方式。

所以我猜这意味着我在使用的库代码中存在问题。但我仍然不知道发生了什么。

Answer 1

我想出了这个问题。

我做了两个字符串文字的十六进制转储（在原始代码中打破的字符串文字和正在运行的字符串文字）。

这是破碎的字符串文字（我在Windows上写的）：

00000000: 5522 bf43 f36d 6f20 6573 74e1 733f 220a  U".C.mo est.s?".

这是工作字符串文字（我在Ubuntu机器上写的）：

00000000: 5522 c2bf 43c3 b36d 6f20 6573 74c3 a173  U"..C..mo est..s
00000010: 3f22 0a                                  ?".

虽然它们在代码编辑器中看起来完全相同，即使它们都有U前缀，但它们在源代码中的编码方式也不同。

虽然我不太确定哪种编码是哪种，但我已经从中检查了文字的源代码编码是非常非常重要。

编辑1

正如@melpomene在评论中指出的那样：

损坏的编码为Windows-1252。

工作编码为UTF-8。

GCC和CLang不会识别Unicode字符串

1 个答案: