Question

如何在输出十进制数字时将其转换为相应的Unicode字符，例如225？我可以将ASCII字符从十进制转换为如下字符：

int a = 97;
char b = a;
cout << b << endl;

它输出字母＆＃34; a＆＃34;，但它只是在我使用数字225或任何非ascii字符时输出问号。

Answer 1

首先，它不是你的C ++程序，它将写入标准输出的字节串转换为可见字符;它是你的终端（或者更常见的是你的终端模拟器）。不幸的是，没有办法询问终端如何编码字符，因此需要将其配置到您的环境中;通常，通过设置适当的locale环境变量来完成。

与大多数与终端有关的事情一样，如果没有使用多年遗留软件和硬件的历史开发，那么区域设置配置系统可能会有很大的不同，其中大多数是最初设计的没有太多考虑重音字母，音节或表意文字等细节。 C＆＃39; est la vie。

Unicode非常酷，但它也必须面对编写系统的计算机表示的特定历史，这意味着面对各种牢固但极其矛盾的意见，要做出很多妥协。软件工程社区dicho sea de paso是一个社区，在这个社区中，头部对接是比较常见的妥协。 Unicode最终变得或多或少标准这一事实证明了其坚实的技术基础以及其发起人和设计师的坚持不懈和政治技巧 - 特别是Mark Davis - 我说这个尽管事实上它已经花了二十多年才达到这一点。

这种谈判和妥协历史的一个方面是将Unicode字符串编码为位的方法不止一种。至少有三种方式，其中两种方式有两种不同的版本，具体取决于字节顺序;而且，这些编码系统中的每一个都有其专用的风扇（因此，其教条的批评者）。特别是，Windows早期决定使用大多数16位编码UTF-16，而大多数类似unix（类似）的系统使用可变长度的8到32位编码UTF-8。（从技术上讲，UTF-16也是一种16位或32位编码，但这超出了这种咆哮的范围。）

在Unicode之前，每个国家/地区/语言都使用他们自己的特殊8位编码（或者至少是那些语言用少于194个字符编写的国家/地区）。因此，将编码配置为本地表示的一般配置的一部分是有意义的，例如月份名称，货币符号，以及将数字的整数部分与其小数分开的字符。既然Unicode上存在广泛的（但仍然很普遍）收敛，那么locales包含Unicode编码的特殊风格似乎很奇怪，因为所有的风格都可以表示相同的Unicode字符串，并且编码通常特定于特定的使用的软件比国家的特质。但是，这就是为什么在我的Ubuntu框中，环境变量LANG设置为es_ES.UTF-8而不仅仅是es_ES。（或者es_PE，应该是这样，除了我一直遇到与该语言环境有关的小问题。）如果你正在使用linux系统，你可能会发现类似的东西。

理论上，这意味着我的终端模拟器（konsole，当它发生时，但有各种各样）期望看到UTF-8序列。事实上，konsole非常聪明，可以检查区域设置并设置其默认编码以匹配，但我可以自由更改编码（或区域设置），并且可能会导致混淆

因此，假设您的语言环境设置和终端使用的编码实际上是同步的，它们应该在配置良好的工作站上，然后返回到C ++程序。现在，C ++程序需要确定它应该使用哪种编码，然后从它使用的任何内部表示转换为外部编码。

幸运的是，如果您合作，C ++标准库应该正确处理：

告诉标准库使用配置的语言环境，而不是默认的C（即只有非英语字符，根据英语）语言环境;以及
使用基于wchar_t（或其他宽字符格式）的字符串和iostream。

如果你这样做，从理论上讲，你不需要知道wchar_t对标准库的意义，也不知道特定的位模式对终端仿真器的意义。所以，让我们尝试一下：

#include <iostream>
#include <locale>

int main(int argc, char** argv) {
  // std::locale()   is the "global" locale
  // std::locale("") is the locale configured through the locale system
  // At startup, the global locale is set to std::locale("C"), so we need
  // to change that if we want locale-aware functions to use the configured
  // locale.
  // This sets the global" locale to the default locale. 
  std::locale::global(std::locale(""));

  // The various standard io streams were initialized before main started,
  // so they are all configured with the default global locale, std::locale("C").
  // If we want them to behave in a locale-aware manner, including using the
  // hopefully correct encoding for output, we need to "imbue" each iostream
  // with the default locale.
  // We don't have to do all of these in this simple example,
  // but it's probably a good idea.
  std::cin.imbue(std::locale());
  std::cout.imbue(std::locale());
  std::cerr.imbue(std::locale());
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());
  std::wcerr.imbue(std::locale());

  // You can't write a wchar_t to cout, because cout only accepts char. wcout, on the
  // other hand, accepts both wchar_t and char; it will "widen" char. So it's
  // convenient to use wcout:
  std::wcout << "a acute: " << wchar_t(225) << std::endl;
  std::wcout << "pi:      " << wchar_t(960) << std::endl;
  return 0;
}

这适用于我的系统。因人而异。祝你好运。

小旁注：我遇到很多人认为wcout会自动写出＆＃34;宽字符＆＃34;，因此使用它会产生UTF-16或UTF- 32或什么的。它没有。它产生与cout完全相同的编码。差异不在于它的输出，而在于它作为输入接受的内容。实际上，它实际上可能与cout不同，因为它们都连接到同一个OS流，一次只能有一个编码。

你可能会问为什么有两个不同的iostream s。为什么cout无法接受wchar_t和std::wstring值？我实际上没有答案，但我怀疑这是不付你不需要的功能的哲学的一部分。或类似的东西。如果你搞清楚了，请告诉我。

Answer 2

如果由于某种原因你想完全靠你自己处理：

void GetUnicodeChar(unsigned int code, char chars[5]) {
    if (code <= 0x7F) {
        chars[0] = (code & 0x7F); chars[1] = '\0';
    } else if (code <= 0x7FF) {
        // one continuation byte
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xC0 | (code & 0x1F); chars[2] = '\0';
    } else if (code <= 0xFFFF) {
        // two continuation bytes
        chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xE0 | (code & 0xF); chars[3] = '\0';
    } else if (code <= 0x10FFFF) {
        // three continuation bytes
        chars[3] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[2] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[1] = 0x80 | (code & 0x3F); code = (code >> 6);
        chars[0] = 0xF0 | (code & 0x7); chars[4] = '\0';
    } else {
        // unicode replacement character
        chars[2] = 0xEF; chars[1] = 0xBF; chars[0] = 0xBD;
        chars[3] = '\0';
    }
}

然后使用它：

char chars[5];
GetUnicodeChar(225, chars);
cout << chars << endl; // á

GetUnicodeChar(0x03A6, chars);
cout << chars << endl; // Φ

GetUnicodeChar(0x110000, chars);
cout << chars << endl; // �

请注意，这只是一种标准的UTF-8编码算法，因此如果您的平台不采用UTF-8，则可能无法正确呈现。（谢谢，@ EmilioGaravaglia）

C ++中的十进制到Unicode字符

2 个答案: