Question

我想在utf8文本文件上进行一些简单的字符串操作。这将意味着从一行中获取子串并输出它们重新排列。

由于我的linux计算机有一个utf8语言环境，我不打算运行在别处设置locale到utf8的程序似乎是要走的路。调整一个例子我接受了测试程序。如果你给它一个希腊词，它输出相同但输出substr的结果只产生垃圾。是否有其他功能我可以使用或正在使用utf8语言环境完全错误的方式去？

    #include <string>
    #include <iostream>

    int main()
    {
        std::string newwd;
        setlocale(LC_ALL, "");
        std::cout << "Enter greek word ";
        std::string wordgr;
        std::getline(std::cin, wordgr);
        std::cout << "The word is " << wordgr << "." << std::endl;
        newwd=wordgr.substr(2,1) ;
        std::cout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl;
        return 0;
    }

Answer 1

UTF-8是一种可变长度编码; UTF-8中的给定字符长度可以在1到6个字节之间。这会导致substr（）方法对字节而不是字符进行操作，从而产生意外结果。 UTF-8中的希腊字符不是单字节字符。如果您输入一个4个字符的希腊字符串，然后在该字上调用std::string.length()，则会得到大于4个字节的结果（最可能是8个字节）。

Answer 2

这在我的系统and on IDEOne上按预期工作。

#include <string>
#include <iostream>

int main()
{
    std::wstring newwd;
    setlocale(LC_ALL, "");
    std::wcout << "Enter greek word ";
    std::wstring wordgr;
    std::getline(std::wcin, wordgr);
    std::wcout << "The word is " << wordgr << "." << std::endl;
    newwd=wordgr.substr(2,1) ;
    std::wcout << "3rd letter is " << wordgr.substr(2,1) << " <" << std::endl;
    return 0;
}

Answer 3

如果您在应用程序中使用UTF-8，则需要考虑适当的库：utf8-cpp。 std :: string或std :: wstring不是一个选项，因为UTF-8字符可以有可变长度，请查看wiki以获取更多信息。

以下是证明此概念的示例代码。

#include <string>
#include <iostream>
#include "source/utf8.h" // path to the utf8-cpp library header

int main()
{
        setlocale(LC_ALL, "");
        std::cout << "Enter greek word ";
        std::string wordgr;
        std::getline(std::cin, wordgr);
        std::cout << "The word is " << wordgr << "." << std::endl;
        std::string::iterator end_it = utf8::find_invalid(wordgr.begin(), wordgr.end());
        if (end_it != wordgr.end()) {
                std::cout << "Invalid utf-8 encoding" << std::endl;
                return 0;
        }
        // utf-8 string length
        std::cout << "Length is " << utf8::distance(wordgr.begin(), end_it) << std::endl;

        // utf-8 string symbol traverse
        std::string::iterator curr_it = wordgr.begin();
        std::string::iterator next_it = curr_it;
        utf8::next(next_it, wordgr.end());
        while(curr_it != wordgr.end()) {
                std::cout << std::string(curr_it, next_it) << " - ";
                curr_it = next_it;
                if (next_it != wordgr.end()) {
                        utf8::next(next_it, wordgr.end());
                }
        }
        return 0;
}

输出如下：

./a.out 
Enter greek word Вова
The word is Вова.
Length is 4
В - о - в - а -

utf8语言环境中的C ++字符串操作

3 个答案: