Question

这是我的第一个问题，如果我错过了重要的规则，请随意批评或纠正我。

最近我的任务是将旧的DOS C代码移植到Linux平台上。字体处理由bitfonts实现。我写了一个函数，如果你传入正确的Unicode值，它就能绘制选定的字形。

但是，如果我尝试将char转换为USHORT（函数需要这种类型），当字符在ASCII表之外时，我得到错误的值。

char* test;
test = "°";

printf("test: %hu\n",(USHORT)test[0]);

显示的数字（控制台）应为176，而不是194。

如果您使用＆＃34;！＆＃34;将显示正确的值33。我通过设置GCC编译器标志

确保char是无符号的

-unsigned-char

GCC编译器使用UTF-8编码作为默认编码。我真的不知道现在的问题在哪里。

我是否需要在编译器中添加另一个标志？

更新

在@Kninnug回答的帮助下，我设法编写了一个能够为我生成所需结果的代码。

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <wchar.h>
#include <stdint.h>

int main(void)
{
   size_t n = 0, x = 0;
   setlocale(LC_CTYPE, "en_US.utf8");
   mbstate_t state = {0};
   char in[] = "!°水"; // or u8"zß水"
   size_t in_sz = sizeof(in) / sizeof (*in);

   printf("Processing %zu UTF-8 code units: [ ", in_sz);
   for(n = 0; n < in_sz; ++n)
   {
      printf("%#x ", (unsigned char)in[n]);
   }
   puts("]");

   wchar_t out[in_sz];
   char* p_in = in, *end = in + in_sz;
   wchar_t *p_out = out;
   int rc = 0;
   while((rc = mbrtowc(p_out, p_in, end - p_in, &state)) > 0)
   {
       p_in += rc;
       p_out += 1;
   }

   size_t out_sz = p_out - out + 1;
   printf("into %zu wchar_t units: [ ", out_sz);
   for(x = 0; x < out_sz; ++x)
   {
      printf("%u ", (unsigned short)out[x]);
   }
   puts("]");
}

但是，当我在嵌入式设备上运行时，非ASCII字符会合并到一个wchar中，而不是像我的计算机上那样合并为两个。

我可以使用cp1252的单字节编码（这很好），但我想继续使用unicode。

Answer 1

char（signed或unsigned）是C ¹中的单个字节。 (USHORT)test[0]仅转换test中的第一个字节，但其中的字符在UTF-8编码中占用2（您可以使用strlen检查该字符，该字节计算之前的字节数第一个0字节）。

要获得正确的代码点，您需要解码整个UTF-8序列。您可以使用mbrtowc和相关功能执行此操作：

char* test;
test = "°";
int len = strlen(test);

wchar_t code = 0;
mbstate_t state = {0};

// convert up to len bytes in test, and put the result in code
// state is used when there are incomplete sequences: pass it to
// the next call to continue decoding
mbrtowc(&code, test, len, &state); // you should check the return value

// here the cast is needed, since a wchar_t is not (necessarily) a short
printf("test: %hu\n", (USHORT)code);

附注：

如果USHORT是16位（通常情况下），则不足以覆盖整个UTF-8范围，这需要（至少）21位。
获得正确的代码点后，无需将演员表传递给绘图函数。如果函数定义或原型可见，编译器可以自行转换值。

¹令人困惑的名称来自all the world's English和所有ASCII码点可以放在一个字节中的时间。因此，一个字符与一个字节相同。

通过将char转换为USHORT来获取错误的UTF-8值

1 个答案: