应用错误收集

mingw中用于UTF-8的C ++ ctype facet

时间：2009-08-30 15:39:50

标签： c++ utf-8 mingw ctype

在项目中，所有内部字符串都以utf-8编码保存。该项目移植到Linux和Windows。现在需要一个to_lower功能。

在POSIX OS上，我可以使用std :: ctype_byname（“ru_RU.UTF-8”）。但是使用g ++（Debian 4.3.4-1），ctype :: tolower（）不能识别俄语的UTF-8字符（拉丁语文本是小写的）。

在Windows上，当我尝试使用“ru_RU.UTF-8”参数构造std :: ctype_byname时，mingw的标准库抛出异常“std :: runtime_error：locale :: facet :: _ S_create_c_locale name not valid”。

如何在Windows上为utf-8实现/查找std :: ctype？该项目已经依赖于libiconv（codecvt facet基于它），但我没有看到用它实现to_lower的明显方法。

3 个答案:

答案 0 :(得分：3)

尝试使用STLport

  Here is a description of how you can use STLport to read/write utf8 files.
utf8 is a way of encoding wide characters. As so, management of encoding in
the C++ Standard library is handle by the codecvt locale facet which is part
of the ctype category. However utf8 only describe how encoding must be
performed, it cannot be used to classify characters so it is not enough info
to know how to generate the whole ctype category facets of a locale
instance.

In C++ it means that the following code will throw an exception to
signal that creation failed:

#include 
// Will throw a std::runtime_error exception.
std::locale loc(".utf8");

For the same reason building a locale with the ctype facets based on
UTF8 is also wrong:

// Will throw a std::runtime_error exception:
std::locale loc(locale::classic(), ".utf8", std::locale::ctype);

The only solution to get a locale instance that will handle utf8 encoding
is to specifically signal that the codecvt facet should be based on utf8
encoding:

// Will succeed if there is necessary platform support.
locale loc(locale::classic(), new codecvt_byname(".utf8"));

  Once you have obtain a locale instance you can inject it in a file stream to
read/write utf8 files:

std::fstream fstr("file.utf8");
fstr.imbue(loc);

You can also access the facet directly to perform utf8 encoding/decoding operations:

typedef std::codecvt codecvt_t;
const codecvt_t& encoding = use_facet(loc);

Notes:

1. The dot ('.') is mandatory in front of utf8. This is a POSIX convention, locale
names have the following format:
language[_country[.encoding]]

Ex: 'fr_FR'
    'french'
    'ru_RU.koi8r'

2. utf8 encoding is only supported for the moment under Windows. The less common
utf7 encoding is also supported.

答案 1 :(得分：2)

如果你需要的只是to_lower的西里尔字符，你可以自己编写一个函数。

АБВГДЕЖ in UTF8  D0 90 D0 91 D0 92 D0 93 D0 94 D0 95 D0 96 0A
абвгдеж in UTF8  D0 B0 D0 B1 D0 B2 D0 B3 D0 B4 D0 B5 D0 B6 0A

但不要忘记UTF8是多字节编码。

您也可以尝试将字符串从UTF8转换为wchar_t（使用libiconv）并使用Windows特定的函数来实现to_lower。

答案 2 :(得分：0)

有一些STL（例如来自Apache - STDCXX的STL）附带了几个语言环境。但在其他情况下，语言环境仅依赖于系统。

如果在操作系统的一个上使用名称“ru_RU.UTF-8”，则并不意味着其他系统对此语言环境具有相同的名称。 Debian和windows可能有其他名称，这就是你有运行时异常的原因。

您之前应该在系统上安装所需的语言环境。或者使用已具有此语言环境的STL。

我的美分......