Question

在Linux和Mac中，我有std :: string和utf-8字符（某些拉丁语，一些非拉丁语）。

我们知道，utf-8字符大小不是固定的，并且某些字符不只是1个字节（例如常规的拉丁字符）。

问题是如何获取偏移量 i 中的字符？

使用int32数据类型存储char是有意义的，但是如何获取该字符？

例如：

std::string str = read_utf8_text();
int c_can_be_more_than_one_byte = str[i]; // <-- obviously this code is wrong

重要的是要指出我不不知道偏移量 i 中字符的大小。

Answer 1

这很简单。

首先，您必须了解，不能计算位置而无需迭代字符串（很明显是fr var-length个字符）

第二，您需要记住，utf-8中的字符可以是1-4个字节，并且如果它们占据一个以上的字节，则所有尾随字节均应设置10个有效位。因此，您只计算字节，如果(byte_val & 0xC0) == 0x80则忽略它们。

不幸的是，我现在没有可使用的编译器，因此请注意代码中可能的错误：

int desired_index = 19;
int index = 0;
char* p = my_str.c_str(); 
while ( *p && index < desired_index ){
  if ( (*p & 0xC0) != 0x80 ) // if it is first byte of next character
    index++;
  p++;
}

// now p points to trailing (2-4) bytes of previous character, skip them
while ( (*p & 0xC0) == 0x80 )
  p++;

if ( *p ){
  // here p points to your desired char
} else {
  // we reached EOL while searching
}

在linux / mac

1 个答案: