Question

这听起来像一个简单的问题，但C ++让它变得困难（至少对我来说）：我有一个wstring，我想把第一个字母作为wchar_t对象，然后从字符串中删除第一个字母。

此处不适用于非ASCII字符：

wchar_t currentLetter = word.at(0);

因为它为德语元音等字符返回两个字符（在循环中）。

这里不起作用：

wchar_t currentLetter = word.substr(0,1);

error: no viable conversion from 'std::basic_string<wchar_t>' to 'wchar_t'

这两点都没有：

wchar_t currentLetter = word.substr(0,1).c_str();

error: cannot initialize a variable of type 'wchar_t' with an rvalue of type 'const wchar_t *'

还有其他想法吗？

干杯，

马丁

----更新----- 这是一些可以演示问题的可执行代码。这个程序将遍历所有字母并逐个输出：

#include <iostream>
using namespace std;

int main() {
    wstring word = L"für";
    wcout << word << endl;
    wcout << word.at(1) << " " << word[1] << " " << word.substr(1,1) << endl;

    wchar_t currentLetter;
    bool isLastLetter;

    do {
        isLastLetter = ( word.length() == 1 );
        currentLetter = word.at(0);
        wcout << L"Letter: " << currentLetter << endl;

        word = word.substr(1, word.length()); // remove first letter
    } while (word.length() > 0);

    return EXIT_SUCCESS;
}

但是，我得到的实际输出是：

F≥Cř ？？？信：f 信件：？信：r

源文件以UTF8编码，控制台的编码也设置为UTF8。

Answer 1

以下是Sehe提供的解决方案：

#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>

using namespace std;

template <typename C>
std::string to_utf8(C const& in)
{
    std::string result;
    auto out = std::back_inserter(result);
    auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);

    std::copy(begin(in), end(in), utf8out);
    return result;
}

int main() {
    wstring word = L"für";

    bool isLastLetter;

    do {
        isLastLetter = ( word.length() == 1 );
        auto currentLetter = to_utf8(word.substr(0, 1));
        cout << "Letter: " << currentLetter << endl;

        word = word.substr(1, word.length()); // remove first letter
    } while (word.length() > 0);

    return EXIT_SUCCESS;
}

输出：

Letter: f

Letter: ü

Letter: r

是的，你需要Boost，但似乎你无论如何都需要一个外部库。

1

C ++不知道Unicode。使用ICU等外部库（UnicodeString类）或Qt（QString类），都支持Unicode，包括UTF-8。

2

由于UTF-8具有可变长度，所以各种索引都可以以代码单位索引，而不是代码点。这是不可能的因为它的原因，UTF-8序列中的代码点上的随机访问变长的性质。如果你想随机访问，你需要使用固定长度编码，如UTF-32。为此，您可以使用U前缀在字符串上。

3

C ++语言标准没有明确编码的概念。它只是   包含一个“系统编码”的不透明概念，wchar_t是   “足够大”的类型。

从不透明系统编码转换为显式外部编码   编码时，必须使用外部库。选择的图书馆   将是iconv（）（从WCHAR_T到UTF-8），它是Posix和的一部分   虽然在Windows上可以在许多平台上使用   WideCharToMultibyte函数保证生成UTF8。

C ++ 11以std :: string s = u8“Hello”的形式添加新的UTF8文字   世界：\ U0010FFFF“;那些已经是UTF8，但他们不能   与不透明的wstring接口，而不是通过我的方式   描述

4 (about source files but still sorta relevant)

C ++中的编码非常复杂。这是我的理解   它的。

每个实现都必须支持基本来源的字符   字符集。这些包括§2.2/ 1中列出的常见字符   （C ++ 11中的§2.3/ 1）。这些字符应该都适合一个字符。在   额外的实现必须支持一种命名其他方法   使用称为通用字符名称的方式看起来像   \ uffff或\ Uffffffff并可用于指代unicode字符。一个   它们的子集可用于标识符（在附录E中列出）。

这很好，但是从文件中的字符映射到   源字符（在编译时使用）是实现定义的。   这构成了使用的编码。

C ++如何获取wstring的第一个字母

1 个答案: