Question

我想要一个类似“ .c”的模式，匹配“。”。并使用std :: regex加上任何utf8和“ c”。

我已经在Microsoft C ++和g ++下尝试过。每次“。”，我都会得到相同的结果。只匹配一个字节。

这是我的测试用例：

#include <stdio.h>
#include <iostream>
#include <string>
#include <regex>

using namespace std;

int main(int argc, char** argv)
{
    // make a string with 3 UTF8 characters
    const unsigned char p[] = { 'a', 0xC2, 0x80, 'c', 0 };
    string tobesearched((char*)p);

    // want to match the UTF8 character before c
    string pattern(".c");
    regex re(pattern);

    std::smatch match;
    bool r = std::regex_search(tobesearched, match, re);
    if (r)
    {
        // m.size() will be bytes, and we expect 3
        // expect 0xC2, 0x80, 'c'

        string m = match[0];
        cout << "match length " << m.size() << endl;

        // but we only get 2, we get the 0x80 and the 'c'.
        // so it's matching on single bytes and not utf8
        // code here is just to dump out the byte values.
        for (int i = 0; i < m.size(); ++i)
        {
            int c = m[i] & 0xff;
            printf("%02X ", c);
        }
        printf("\n");
    }
    else
        cout << "not matched\n";

    return 0;
}

我希望模式“ .c”匹配我的tobesearched字符串的3个字节，其中前两个是2字节的utf8字符，后跟'c'。

Answer 1

某些正则表达式类型支持\X，它将与单个unicode字符匹配，该字符可能由多个字节组成，具体取决于编码。 regex引擎通常会采用该引擎旨在使用的编码来获取主题字符串的字节，因此您不必担心实际的编码（无论是US-ASCII，UTF-8， UTF-16或UTF-32）。

另一个选择是\uFFFF，其中FFFF指的是Unicode字符集中该索引处的Unicode字符。这样，您就可以在字符类（即[\u0000-\uFFFF]）内创建范围匹配。同样，这取决于正则表达式风味所支持的内容。 \u中还有\x{...}的另一种变体，它的功能相同，只是必须在花括号内提供unicode字符索引，而无需填充，例如\x{65}。

编辑：该网站可让您深入了解各种口味的正则表达式https://www.regular-expressions.info

编辑2 ：要匹配Unicode独有的字符（即，排除ASCII表中的字符/ 1字节字符），您可以尝试"[\x{80}-\x{FFFFFFFF}]"，即任何值为128的字符-4,294,967,295，从ASCII范围之外的第一个字符到当前使用最多4个字节表示形式的最后一个unicode字符集索引（原本为6，将来可能会更改）。

循环遍历各个字节会更有效：

如果前导位为0，即，如果其有符号值为> -1，则它是1字节的char表示形式。跳到下一个字节，然后重新开始。
否则，如果前导位为11110，即其有符号值为> -17，n=4。
否则，如果前导位为1110，即其有符号值为> -33，n=3。
否则，如果前导位为110，即其有符号值为> -65，n=2。
（可选）检查接下来的n个字节，每个字节以10开头，即每个字节是否具有带符号的值< -63，是否为无效的UTF-8编码。
您现在知道前n个字节构成了unicode专有字符。因此，如果NEXT字符为'c'，即== 99，则可以说它与-return true相匹配。

如何使std :: regex匹配Utf8

1 个答案: