Question

在这个问题中：Convert ISO-8859-1 strings to UTF-8 in C/C++

有一个非常简洁的c ++代码可以将ISO-8859-1字符串转换为UTF-8。

在这个答案中：https://stackoverflow.com/a/4059934/3426514

我仍然是c ++的初学者，我很难理解它是如何工作的。我已经阅读了UTF-8的编码序列，我知道＆lt; 128字符是相同的，并且在128以上第一个字节获得一个前缀，其余的位分布在几个字节开始10xx，但我看到这个答案没有任何变化。

如果有人可以帮我将其分解为只处理1个字符的函数，那么这对我有用了。

Answer 1

代码，评论。

这是因为Latin-1 0x00到0xff映射到连续的UTF-8代码序列0x00-0x7f，0xc2 0x80-bf，0xc3 0x80-bf。

// converting one byte (latin-1 character) of input
while (*in)
{
    if ( *in < 0x80 )
    {
        // just copy
        *out++ = *in++;
    }
    else
    {
         // first byte is 0xc2 for 0x80-0xbf, 0xc3 for 0xc0-0xff
         // (the condition in () evaluates to true / 1)
         *out++ = 0xc2 + ( *in > 0xbf ),

         // second byte is the lower six bits of the input byte
         // with the highest bit set (and, implicitly, the second-
         // highest bit unset)
         *out++ = ( *in++ & 0x3f ) + 0x80;
    }
}

处理单个（输入）字符的函数的问题是输出可能是一个或两个字节，使得函数使用起来有点笨拙。处理整个字符串时，通常情况下（代码的性能和清洁度都会更好）。

请注意，Latin-1作为输入编码的假设非常可能是错误的。例如，Latin-1没有欧元符号（€）或任何这些字符ŠšŽžŒœŸ，这使得欧洲大多数人使用Latin-9或CP-1252，即使他们没有意识到这一点。（＆＃34;编码？不知道.Latin-1？是的，听起来不错。＆＃34;）

所有这一切，都是 C 的方式。 C ++ 方式（可能希望）看起来更像是这样：

#include <unistr.h>
#include <bytestream.h>

// ...

icu::UnicodeString ustr( in, "ISO-8859-1" );

// ...work with a properly Unicode-aware string class...

// ...convert to UTF-8 if necessary.
char * buffer[ BUFSIZE ];
icu::CheckedArrayByteSink bs( buffer, BUFSIZE );
ustr.toUTF8( bs );

那是使用International Components for Unicode（ICU）库。请注意，这适用于不同的输入编码。不同的输出编码，iostream操作符，字符迭代器，甚至C API都可以从库中获得。

在字符串编码中简化c ++表达式

1 个答案: