Question

使用英文字符很容易提取，也就是说，来自字符串的字符，例如，以下代码应该有y作为输出：

string my_word;
cout << my_word.at(1);

如果我尝试对希腊字符做同样的事情，我会得到一个有趣的角色：

string my_word = "λογος";
cout << my_word.at(1);

输出：

�

我的问题是：我可以做些什么才能使.at（）或其他类似功能起作用？

非常感谢！

Answer 1

问题很复杂。非拉丁字符必须正确编码。有几个标准。问题是您的系统正在使用哪种编码。

在UTF-8编码中，一个字符由多个字节表示。它可以在1到4个字节之间变化，具体取决于它的字符类型。 For example: λ由两个字节（十六进制）表示：CE BB。

我不知道为希腊字母提供单字节字符的其他字符编码是什么，但我确定有一种这样的编码。

请注意，您的值my_word.length()很可能会返回10而不是5。

Answer 2

std::string是一系列狭义字符char。但是，当使用utf-8语言环境时，许多国家字母表使用多个字符来编码单个字母。所以，当你拿s.at(0)时，你会获得整整一半的信甚至更少。您应该使用宽字符：std::wstring而不是std::string，std::wcout而不是std::cout和L"λογος"作为字符串文字。

此外，您应该在使用std::locale内容进行打印之前设置正确的区域设置。

此案例的代码示例：

#include <iostream>
#include <string>
#include <locale>

int main(int, char**) {
    std::locale::global(std::locale("en_US.utf8"));
    std::wcout.imbue(std::locale());
    std::wstring s = L"λογος";
    std::wcout << s.at(0) << std::endl;
    return 0;
}

Answer 3

正如其他人所说，这取决于你的编码。一旦你转向国际化，at（）函数就会出现问题，因为例如，希伯来语中包含了围绕角色编写的元音。并非所有脚本都包含不连续的字形序列。

通常最好将字符串视为原子字符串，除非您正在编写显示/字处理代码本身，当然您需要单独的字形。要阅读UTF，请查看Baby X中的代码（这是一个必须在屏幕上绘制文本的窗口系统）

此处;链接https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c

这是UTF8代码 - 它是相当大的代码，但从根本上说是直截了当。

static const unsigned int offsetsFromUTF8[6] = 
{
    0x00000000UL, 0x00003080UL, 0x000E2080UL,
    0x03C82080UL, 0xFA082080UL, 0x82082080UL
};

static const unsigned char trailingBytesForUTF8[256] = {
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};

int bbx_isutf8z(const char *str)
{
  int len = 0;
  int pos = 0;
  int nb;
  int i;
  int ch;

  while(str[len])
    len++;
  while(pos < len && *str)
  {
    nb = bbx_utf8_skip(str);
    if(nb < 1 || nb > 4)
      return 0;
    if(pos + nb > len)
      return 0;
    for(i=1;i<nb;i++)
      if( (str[i] & 0xC0) != 0x80 )
        return 0;
    ch = bbx_utf8_getch(str);
    if(ch < 0x80)
    {
      if(nb != 1)
        return 0;
    }
    else if(ch < 0x8000)
    {
      if(nb != 2)
        return 0;
    }
    else if(ch < 0x10000)
    {
      if(nb != 3)
        return 0;
    }
    else if(ch < 0x110000)
    {
      if(nb != 4)
        return 0;
    }
    pos += nb;
    str += nb;    
  }

  return 1;
}

int bbx_utf8_skip(const char *utf8)
{
  return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}

int bbx_utf8_getch(const char *utf8)
{
    int ch;
    int nb;

    nb = trailingBytesForUTF8[(unsigned char)*utf8];
    ch = 0;
    switch (nb) 
    {
            /* these fall through deliberately */
        case 3: ch += (unsigned char)*utf8++; ch <<= 6;
        case 2: ch += (unsigned char)*utf8++; ch <<= 6;
        case 1: ch += (unsigned char)*utf8++; ch <<= 6;
        case 0: ch += (unsigned char)*utf8++;
    }
    ch -= offsetsFromUTF8[nb];

    return ch;
}

int bbx_utf8_putch(char *out, int ch)
{
  char *dest = out;
  if (ch < 0x80) 
  {
     *dest++ = (char)ch;
  }
  else if (ch < 0x800) 
  {
    *dest++ = (ch>>6) | 0xC0;
    *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x10000) 
  {
     *dest++ = (ch>>12) | 0xE0;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x110000) 
  {
     *dest++ = (ch>>18) | 0xF0;
     *dest++ = ((ch>>12) & 0x3F) | 0x80;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else
    return 0;
  return dest - out;
}

int bbx_utf8_charwidth(int ch)
{
    if (ch < 0x80)
    {
        return 1;
    }
    else if (ch < 0x800)
    {
        return 2;
    }
    else if (ch < 0x10000)
    {
        return 3;
    }
    else if (ch < 0x110000)
    {
        return 4;
    }
    else
        return 0;
}

int bbx_utf8_Nchars(const char *utf8)
{
  int answer = 0;

  while(*utf8)
  {
    utf8 += bbx_utf8_skip(utf8);
    answer++;
  }

  return answer;
}

C ++字符串的希腊字符和.at（）运算符

3 个答案: