Question

我正尝试将 Unicode字符'NO-BREAK SPACE'（U + 00C2）即替换为字符串str中的空格“” ，在遇到某些情况时，这会给我分段错误。谁能建议我这是无效的内存访问。
我做对了吗？
还有其他方法吗？

string str = "transaction applies: Â Â Â Â Â Â Â Â Â {79}";
void cleanup(string& str)
    {   
        string unicode = "\u00C2";
        size_t pos = str.find(unicode);
        while(str.find(unicode, pos)!=string::npos && pos != str.length())
        {   
            pos = str.find(unicode, pos);
            str.replace(pos, unicode.length(), " " ); //unicode replace by a space  
            // this above line is giving segmentation fault
            pos = pos + unicode.length();
        }
        return;
    }

输出：

Answer 1

只需这样做就可以了。

char cd[]="my command \r\n"

Answer 2

主要问题是要用空格替换所有字符，然后搜索下一个特殊字符。尝试替换不存在的字符后– pos = string::npos。

这里是while循环的简单修改，因此可以正常工作（有用的cout用于跟踪）：

void cleanup(string& str) {
  string unicode = "\u00C2"; // \u00C2 is Â
  size_t pos = str.find(unicode);
  while(pos != string::npos) {
    cout << "str: " << str << "\tpos: " << pos << endl;
    str.replace(pos, unicode.length(), " ");
    pos = pos + unicode.length();
    pos = str.find(unicode, pos);
  }
}

您可以（可能应该）将其修改为对字符串字符进行逐个循环，或者进行do-while循环以简化代码。但是如上所述，它不起作用的原因是，您尝试在不存在的pos上进行了替换。

并非建议这是最好的程序，而是可能转换为for-each循环，因此您可以在C ++ 11中看到一个示例：

#include <locale>
#include <codecvt>
#include <iostream>
#include <string>

using namespace std;

// Using u16string because of unicode characters
void cleanup(u16string& str) {
  for(auto& c : str)
    if(c == u'\u00C2')
      c = u' ';
}

int main() {
  u16string str = u"transaction applies: \u00C2 \u00C2 \u00C2 \u00C2 \u00C2 \u00C2 \u00C2 \u00C2 \u00C2 {79}";
  cleanup(str);

  wstring_convert<codecvt_utf8<char16_t>, char16_t> convert;
  cout << convert.to_bytes(str) << endl;
}

Answer 3

unicode字符及其utf-8表示形式之间存在混淆。 NO-BREAK SPACE实际上是Unicode字符U + 00A0，其utf-8表示形式为"\xc2\xa0"。 Â是带有CIRCUMFLEX的拉丁文大写字母A，或Unicode字符U + 00C2，其Unicode表示形式为"\xc3\x82"。

这意味着您的初始字符串不包含任何NO-BREAK空格。如果您的编辑器字符集是Latin1或Windows cp1252，则它将包含“ \ xc2 \ x20”的重复（即latin1编码为“Â”和空格），如果它是utf8，则将包含“ \ xc3 \ x82 \ x20”的重复”（即utf8编码的“'”和空格）。然后，当您搜索“ \ u00A0”的出现时，实际上是搜索“ \ xc2 \ xa0”的出现，该字符串在原始字符串中不存在。分段错误是由于pos被std::string::npos引起的：str.find(unicode, pos)调用了未定义的行为。

该做什么：选择自己的一面。在使用窄字符串时，必须决定使用哪种编码。如果使用utf8（在Linux世界中很常见），则NO-BREAK SPACE字符是2个字符长度的字符串：{ 0xc2, 0xa0 }。这行：

string unicode = "\u00A0";

与此完全相同：

string unicode = "\xc2\xa0";

，并且在使用它之前，大部分控制是要拥有一个有效的pos：

   ...
   size_t pos = str.find(unicode);
   if (pos == string::npos) return;
   ...

在将unicode替换为C ++字符串中的空格时遇到Segmentation Fault错误

3 个答案: