Question

我试图在字符串中找到令牌，其中包含单词，数字和特殊字符。我尝试了以下代码：

#include <iostream>
#include <regex>
#include <string>
using namespace std;
int main() {
    string str("The ,quick brown. fox \"99\" named quick_joe!");
    regex reg("[\\s,.!\"]+");
    sregex_token_iterator iter(str.begin(), str.end(), reg, -1), end;
    vector<string> vec(iter, end);
    for (auto a : vec) {
        cout << a << ":";
    }
    cout    << endl;
}

得到以下输出：

The:quick:brown:fox:99:named:quick_joe:

但我想要输出：

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

我应该使用什么正则表达式？如果可能的话，我想坚持使用标准的c ++，即我不喜欢使用boost的解决方案。

（请参阅43594465了解这个问题的java版本，但现在我正在寻找一个c ++解决方案。基本上，问题是如何将Java的Matcher和Pattern映射到C ++。）< / p>

Answer 1

您要求将不匹配的子串（子匹配-1）与整个匹配的子串（子匹配0）交错，这略有不同：

sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,0}), end;

这会产生：

The: ,:quick: :brown:. :fox: ":99:" :named: :quick_joe:!:

由于您只想删除空格，因此请更改正则表达式以使用周围的空格，并为非空白字符添加捕获组。然后，只需在迭代器中指定子匹配1，而不是子匹配0：

regex reg("\\s*([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;

收率：

The:,:quick brown:.:fox:":99:":named quick_joe:!:

拆分相邻单词之间的空格也需要拆分'只是空格'：

regex reg("\\s*\\s|([,.!\"]+)\\s*");

但是，你最终会得到空的子匹配：

The:::,:quick::brown:.:fox:::":99:":named::quick_joe:!:

很容易放弃那些：

regex reg("\\s*\\s|([,.!\"]+)\\s*");
sregex_token_iterator iter(str.begin(), str.end(), reg, {-1,1}), end;
vector<string> vec;
copy_if(iter, end, back_inserter(vec), [](const string& x) { return x.size(); });

最后：

The:,:quick:brown:.:fox:":99:":named:quick_joe:!:

Answer 2

如果您想使用Java相关问题中使用的方法，也可以在此处使用匹配方法。

regex reg(R"(\d+|[^\W\d]+|[^\w\s])");
sregex_token_iterator iter(str.begin(), str.end(), reg), end;
vector<string> vec(iter, end);

请参阅C++ demo。结果：The:,:quick:brown:.:fox:":99:":named:quick_joe:!:。请注意，此处与Unicode字母不匹配，\w（\d和\s也不会在std::regex中识别Unicode。

模式详情：

\d+ - 一位或多位
| - 或
[^\W\d]+ - 一个或多个ASCII字母或_
| - 或
[^\w\s] - 除了ASCII字母/数字，_和空格之外的1个字符。

使用具有特殊字符的正则表达式标记化c ++字符串

2 个答案: