Question

std::string arrWords[10];
std::vector<std::string> hElemanlar;

...

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

...

我在做的是：arrWord的每个元素都是一个std :: string。我得到了arrWord的第n个元素，然后将它们推入hElemanlar。

假设arrWords [0]是＆＃34; test＆＃34;，那么：

this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");

我的问题是虽然我没有arrWords的编码问题，但是在hElemanlar中没有打印或处理好一些utf-8字符。我该如何解决？

Answer 1

如果您知道arrWords[i]包含UTF-8编码文本，那么您可能需要将字符串拆分为完整的Unicode字符。

顺便说一句，而不是说：

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]).c_str());

（构造一个临时的std :: string，获取它的c-string表示，构造另一个临时字符串，并将其推送到向量上），比如说：

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]))

反正。这将需要成为：

std::string str(1, this-arrWords[sayKelime][j])
if (static_cast<unsigned char>(str[0]) >= 0xC0)
{
   for (const char c = this-arrWords[sayKelime][j+1];
        static_cast<unsigned char>(c) >= 0x80;
        j++)
   {
       str.push_back(c);
   }
}
this->hElemenlar.push_back(str);

请注意，上面的循环是安全的，因为如果j是字符串中最后一个char的索引，[j+1]将返回nul-terminator（将结束循环）。您需要考虑增量j如何与代码的其余部分交互。

然后，您需要考虑是否希望hElemanlar代表单个Unicode代码点（这样做），或者您是否要包含字符+后面的所有组合字符？在后一种情况下，您必须将上面的代码扩展为：

解析下一个代码点
决定是否为组合字符
如果是，则按字符串上的UTF-8序列。
重复（您可以在角色上使用多个组合字符）。

std :: string字符编码

1 个答案: