Question

我正在绞尽一种将UTF8字符转换为代表性的Unicode代码点的类方法。我的原型候选人如下：

static uint32_t Utf8ToWStr( uint8_t Byte1,        uint8_t Byte2 = 0x00,
                            uint8_t Byte3 = 0x00, uint8_t Byte4 = 0x00,
                            uint8_t Byte5 = 0x00, uint8_t Byte6 = 0x00);

static uint32_t Utf8ToWStr(const std::vector<uint8_t> & Bytes);

在我的申请中;
大约90％的时间Byte1将是唯一的非零字节大约9％的时间Byte1和Byte2将是唯一的非零字节 Byte1，Byte2和Byte3将是唯一一个不到1％的非零字节。
Byte4，Byte5和Byte6几乎总是为零。

我希望哪种原型适合速度？

Answer 1

我用

// if you want it as simple as possible
typedef uint8_t data_t[6];

或

// if you like C++11
typedef std::array<uint8_t, 6> data_t;

或

// if it should be extensible
typedef struct { uint8_t data[6]; } data_t;

指出编译时输入数据的固定长度性质。通过这种方式，它可以节省大量实际调用函数的输入。

使用可变长度向量会让我以某种方式暗示可能存在更多或更少或空数据。

Answer 2

可能都没有。

想想调用这个函数的代码 - 他们可能不得不跳过大规模的箍来使用它：

uint8_t c1 = *cursor++;
uint8_t c2 = 0;
uint8_t c3 = 0;
uint8_t c4 = 0;
uint8_t c5 = 0;
uint8_t c6 = 0;
if(c1 >= 0x80)
    c2 = *cursor++;
if(c1 >= 0xc0)
    c3 = *cursor++;
if(c1 >= 0xe0)
    c4 = *cursor++;
if(c1 >= 0xf0)
    c5 = *cursor++;
if(c1 >= 0xf8)
    c6 = *cursor++;
uint32_t wch = Utf8ToWStr(c1, c2, c3, c4, c5, c6);

我真诚地怀疑这个界面是否有用。

转换例程的常规界面是

bool utf8_to_wchar(uint8_t const *&cursor, uint8_t const *end, uint32_t &result);

返回值用于传达错误（例如，您的函数将如何对参数(0x81, 0x00)做出反应？

最后但并非最不重要的一点是，您可能希望使用一种模式来指定非规范化的UTF-8是否应该给出错误 - 从安全POV中，最好禁止将U+003F编码为0x80 0x3f }。

Answer 3

std :: vector可能更慢，因为它将这些字节存储到堆中并为它们分配内存。

你也可以只传递指向字节数组的指针，或者如果使用C ++ 11则使用std :: array。

我应该传递std :: vector还是固定数量的参数？

3 个答案: