Question

我无法想出一种删除前导零的方法。我的目标是在for循环中创建每个数字的UTF-8和UTF-32版本。

例如，使用UTF-8我不会删除前导零？有没有人有解决方案如何解决这个问题？基本上我要问的是：有人有一个简单的解决方案将Unicode代码点转换为UTF-8吗？

    for (i = 0x0; i < 0xffff; i++) {
        printf("%#x \n", i);
        //convert to UTF8
    }

所以这是我要为每个i尝试完成的一个例子。

例如：Unicode值U + 0760（Base 16）将转换为UTF8 as
- in binary：1101 1101 1010 0000
- in hex：DD A0

基本上我试图为每个i做到这一点，将其转换为UTF-8中的十六进制等效值。

我遇到的问题似乎是将Unicode转换为UTF-8的过程涉及从位号中删除前导0。我不确定如何动态地做到这一点。

Answer 1

正如Wikipedia UTF-8页面所描述的那样，每个Unicode代码点（0到0x10FFFF）都以UTF-8字符编码为一到四个字节。

这是一个简单的示例函数，从我之前的一篇文章中编辑过。我现在也从整数常量中删除了U后缀。（...的目的是提醒人类程序员，由于某种原因显然是无符号的常量（负代码点根本没有考虑），并且它确实假设unsigned int code - 编译器不关心，并且可能是因为这种做法似乎很奇怪，甚至让这里的长期成员感到困惑，所以我放弃并停止尝试包括这样的提醒。:(）

static size_t code_to_utf8(unsigned char *const buffer, const unsigned int code)
{
    if (code <= 0x7F) {
        buffer[0] = code;
        return 1;
    }
    if (code <= 0x7FF) {
        buffer[0] = 0xC0 | (code >> 6);            /* 110xxxxx */
        buffer[1] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 2;
    }
    if (code <= 0xFFFF) {
        buffer[0] = 0xE0 | (code >> 12);           /* 1110xxxx */
        buffer[1] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
        buffer[2] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 3;
    }
    if (code <= 0x10FFFF) {
        buffer[0] = 0xF0 | (code >> 18);           /* 11110xxx */
        buffer[1] = 0x80 | ((code >> 12) & 0x3F);  /* 10xxxxxx */
        buffer[2] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
        buffer[3] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 4;
    }
    return 0;
}

为它提供一个unsigned char数组，四个或更大的字符，以及Unicode代码点。该函数将返回在UTF-8中编码代码点所需的字符数，并在数组中分配。对于0x10FFFF以上的代码，该函数将返回0（未编码），但不会检查Unicode代码点是否有效。 IE浏览器。它是一个简单的编码器，它所知道的Unicode是代码点从0到0x10FFFF，包括在内。例如，它对代理对没有任何了解。

请注意，因为代码点显式是无符号整数，所以负参数将根据C规则转换为无符号。

您需要编写一个函数，在每个unsigned char中打印出至少8个有效位（C标准允许更大的char大小，但UTF-8仅使用8位chars）。然后，使用上面的函数将Unicode代码点（0转换为0x10FFFF，包括在内）转换为UTF-8表示，并按升序调用数组中每个unsigned char的bit函数，对于unsigned char的计数，为该代码点返回上述转换函数。

Answer 2

转换为UTF-32是微不足道的，它只是Unicode code point。

#include <wchar.h>

wint_t codepoint_to_utf32( const wint_t codepoint ) {
    if( codepoint > 0x10FFFF ) {
        fprintf( stderr, "Codepoint %x is out of UTF-32 range\n", codepoint);
        return -1;
    }

    return codepoint;
}

请注意，我使用wint_t，w用于＆＃34;宽＆＃34;。这是一个整数，保证足够大以容纳任何wchar_t以及EOF。保证wchar_t（宽字符）足够宽以支持所有系统区域设置。

由于其codepage layout designed to be compatible with 7-bit ASCII，转换为UTF-8有点复杂。需要进行一些位移。

从UTF-8表开始。

U+0000  U+007F    0xxxxxxx
U+0080  U+07FF    110xxxxx  10xxxxxx
U+0800  U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
U+10000 U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx

将其变成一个大的if / else if语句。

wint_t codepoint_to_utf8( const wint_t codepoint ) {
    wint_t utf8 = 0;

    // U+0000   U+007F    0xxxxxxx
    if( codepoint <= 0x007F ) {
    }
    // U+0080   U+07FF    110xxxxx  10xxxxxx
    else if( codepoint <= 0x07FF ) {
    }
    // U+0800   U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
    else if( codepoint <= 0xFFFF ) {
    }
    // U+10000  U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx
    else if( codepoint <= 0x10FFFF ) {
    }
    else {
        fprintf( stderr, "Codepoint %x is out of UTF-8 range\n", codepoint);
        return -1;
    }

    return utf8;
}

开始填补空白。第一个很简单，它只是代码点。

    // U+0000   U+007F    0xxxxxxx
    if( codepoint <= 0x007F ) {
        utf8 = codepoint;
    }

要做下一个，我们需要应用一个位掩码并进行一些位移。 C不支持二进制文字，所以我使用perl -wle 'printf("%x\n", 0b1100000010000000)'

将二进制文件转换为十六进制

    // U+0080   U+07FF    110xxxxx  10xxxxxx
    else if( codepoint <= 0x00007FF ) {
        // Start at 1100000010000000
        utf8 = 0xC080;

        // 6 low bits using the bitmask 00111111
        // That fills in the 10xxxxxx part.
        utf8 += codepoint & 0x3f;

        // 5 high bits using the bitmask 11111000000
        // Shift over 2 to jump the hard coded 10 in the low byte.
        // That fills in the 110xxxxx part.
        utf8 += (codepoint & 0x7c0) << 2;
    }

我会把剩下的留给你。

我们可以用触及每个逻辑的各种有趣的值来测试它。

int main() {    
    // https://codepoints.net/U+0041
    printf("LATIN CAPITAL LETTER A: %x\n", codepoint_to_utf8(0x0041));
    // https://codepoints.net/U+00A2
    printf("Cent sign: %x\n", codepoint_to_utf8(0x00A2));
    // https://codepoints.net/U+2603
    printf("Snowman: %x\n", codepoint_to_utf8(0x02603));
    // https://codepoints.net/U+10160
    printf("GREEK ACROPHONIC TROEZENIAN TEN: %x\n", codepoint_to_utf8(0x10160));

    printf("Out of range: %x\n", codepoint_to_utf8(0x00200000));
}

这是一个有趣的练习，但是如果你想真正使用一个预先存在的库。 Gnome Lib has Unicode manipulation functions，以及更多缺少C的部分。

Answer 3

进行此有趣练习的多种方法，将code point转换为UTF-8。

为了不给它所有的编码经验，以下是一个启动OP的伪代码。

//Original structure
struct
{
    int foo1;
    int foo2;
    int foo3;
} orig_struct = {1,2,3};

//New structure
struct
{
    int bar1;
    int bar2;
} new_struct = {orig_struct.foo1, orig_struct.foo2};

将Unicode代码点转换为UTF-8和UTF-32

3 个答案: