Question

我正在尝试检测Unicode字符的某些组合（如â€‹）以清理字符串。对于单个Unicode字符，它正在检测，但是未检测到Unicode组合。

我正在使用这些字符串从另一个需要清除的HTML页面制作HTML页面。我只想清理具有此类unicode的字符串，即使在浏览器的html页面中也不可见。

下面是示例代码：

void detect_Unicode(string& str) { 

      if(!str.empty() && str.find_first_not_of(" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039")==string::npos)
                str.assign(" ");
      return;
 }

输入字符串：

1. " â€‹    â€‹ " ;
2. "are Â Â there is something Â Â Â â€‹ combination    â€‹"  
3. " Â Â "   
4. "â€‹  Â Â â€‹" 
5 . "Â Â â â"

预期输出：

1. " "  
2. "are Â Â there is something Â Â Â â€‹ combination    â€‹"   
3. " "  
4. " "  
5. " "

也请让我知道其他方式。

Answer 1

好的，接着上面的评论，我认为输入字符串很有可能是UTF-8（毕竟，在HTML上下文中，还会是什么？）。

在此基础上，我谦虚地提交：

#include <string>
#include <codecvt>
#include <locale>

std::string narrow (const std::wstring& ws)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.to_bytes (ws);
}

std::wstring widen (const std::string& s)
{
    std::wstring_convert <std::codecvt_utf8 <wchar_t>, wchar_t> convert;
    return convert.from_bytes (s);
}

std::string detect_Unicode (const std::string& s)
{ 
    std::wstring ws = widen (s);
    if (ws.empty() || ws.find_first_not_of (L" \t\n\r\f\v\u00A0\u00C2\u00E2\u20AC\u2039") != std::wstring::npos)
        return " ";
    return s;
}

#include <iostream>

int main ()
{
    std::cout << narrow (L"\u00A0 \u00C2 \u00E2 \u20AC \u2039\n\n");
    std::cout << "0.\t\"" << detect_Unicode (u8"abcde") << "\"\n";
    std::cout << "1.\t\"" << detect_Unicode (u8" â€‹    â€‹ ") << "\"\n";
    std::cout << "2.\t\"" << detect_Unicode (u8"are Â Â there is something Â Â Â â€‹ combination    â€‹") << "\"\n";
    std::cout << "3.\t\"" << detect_Unicode (u8" Â Â ") << "\"\n";
    std::cout << "4.\t\"" << detect_Unicode (u8"â€‹  Â Â â€‹") << "\"\n";
    std::cout << "5.\t\"" << detect_Unicode (u8"Â Â â â") << "\"\n";
}

输出：

  Â â € ‹

0.  " "
1.  " â€‹    â€‹ "
2.  " "
3.  " Â Â "
4.  "â€‹  Â Â â€‹"
5.  "Â Â â â"

现在这不是OP期望的输出，但是我认为这仅仅是因为detect_Unicode()的 logic （与实现相反）看起来有缺陷。这里的要点是，将输入字符串转换为宽字符串意味着您可以可靠地对其执行标准basic_string操作，因为现在没有多字节问题。

detect_Unicode()的另一种实现方式可能是：

for (auto wide_char : ws)
{
    if (wide_char > 0xff)
        return " ";
}
return s;

但是，实际上，您现在可以输入detect_Unicode的字符串很宽，一切皆有可能，因此请执行疯狂的操作。

其他说明：

std::codecvt在C ++ 17中已被弃用，但是由于没有其他明显的选择，您最好使用它。您可以随时更改narrow和widen的实现。
取决于平台，std::wstring可能不是最佳选择，但这可能很好。您还可以查看std::u16string和std::u32string。

Live demo。

灵感来自here。

如何在c ++字符串中检测“â€”（unicode的组合）

1 个答案: