Question

我有一个目前使用regex表达式解析的文本文件，它运行良好。文件格式定义良好，2个数字，由任何空格分隔，后跟可选注释。

现在，我们需要在此文件中添加一个额外的（但是可选的）第三个数字，使格式，2或3个数字用空格分隔，并带有可选注释。

我有一个regex对象至少匹配所有必要的行格式，但即使它存在，我也没有运气实际捕获第三个（可选）数字。

代码：

#include <iostream>
#include <regex>
#include <vector>
#include <string>
#include <cassert>
using namespace std;

bool regex_check(const std::string& in)
{
   std::regex check{
      "[[:space:]]*?"                    // eat leading spaces
      "([[:digit:]]+)"                   // capture 1st number
      "[[:space:]]*?"                    // each second set of spaces
      "([[:digit:]]+)"                   // capture 2nd number
      "[[:space:]]*?"                    // eat more spaces
      "([[:digit:]]+|[[:space:]]*?)"     // optionally, capture 3rd number
      "!*?"                              // Anything after '!' is a comment
      ".*?"                              // eat rest of line
   };

   std::smatch match;

   bool result = std::regex_match(in, match, check);

   for(auto m : match)
   {
      std::cout << "  [" << m << "]\n";
   }

   return result;
}

int main()
{
   std::vector<std::string> to_check{
      "  12  3",
      "  1  2 ",
      "  12  3 !comment",
      "  1  2     !comment ",
      "\t1\t1",
      "\t  1\t  1\t !comment   \t",
      " 16653    2      1",
      " 16654    2      1 ",
      " 16654    2      1   !    comment",
      "\t16654\t\t2\t   1\t ! comment\t\t",
   };

   for(auto s : to_check)
   {
      assert(regex_check(s));
   }

   return 0;
}

这给出了以下输出：

  [  12  3]
  [12]
  [3]
  []
  [  1  2 ]
  [1]
  [2]
  []
  [  12  3 !comment]
  [12]
  [3]
  []
  [  1  2     !comment ]
  [1]
  [2]
  []
  [ 1   1]
  [1]
  [1]
  []
  [   1   1  !comment       ]
  [1]
  [1]
  []
  [ 16653    2      1]
  [16653]
  [2]
  []
  [ 16654    2      1 ]
  [16654]
  [2]
  []
  [ 16654    2      1   !    comment]
  [16654]
  [2]
  []
  [ 16654       2      1     ! comment      ]
  [16654]
  [2]
  []

正如您所看到的，它匹配所有预期的输入格式，但永远无法实际捕获第三个数字，即使它存在。

我目前正在使用GCC 5.1.1对此进行测试，但实际的目标编译器将是GCC 4.8.2，使用boost::regex代替std::regex。

Answer 1

让我们对以下示例进行逐步处理。

 16653    2      1
^

^是当前匹配的偏移量。在这一点上，我们就是这样的模式：

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^

（我已将[[:space:]]简化为\s，将[[:digit:]]简化为\d以获得礼貌。

\s*?匹配，然后(\d+)匹配。我们最终处于以下状态：

 16653    2      1
      ^

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
         ^

同样的事情：\s*?匹配，然后(\d+)匹配。州是：

 16653    2      1
           ^

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
                  ^

现在，事情变得棘手了。

这里有一个\s*?，一个 lazy 量词。引擎尝试不匹配任何内容，并查看模式的其余部分是否匹配。所以它尝试交替。

第一种选择是\d+，但它失败了，因为你在这个位置没有数字。

第二种选择是\s*?，之后没有其他选择。它太懒了，所以让我们先尝试匹配空字符串。

下一个标记为!*?，但也与空字符串匹配，然后是.*?，后面会匹配到结尾的所有内容字符串（这样做是因为您使用regex_match - 它会将空字符串与regex_search匹配。）

此时，您已成功到达模式的末尾，并且您获得了匹配，而没有被强制匹配\d+字符串。

问题是，这个模式的整个部分最终都是可选：

\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
                  \__________________/

那么，你能做什么？您可以像这样重写您的模式：

\s*?(\d+)\s+(\d+)(?:\s+(\d+))?\s*(?:!.*)?

Demo（添加了模仿regex_match行为的锚点）

这样，您强制正则表达式引擎考虑\d而不是在空字符串上使用延迟匹配。由于\s和\d不相交，所以不需要延迟量词。

!*?.*?也不是最理想的，因为!*?已涵盖以下.*?。我在评论开始时将其重写为(?:!.*)?到要求 !，如果不存在，则匹配将失败。

正则表达式匹配可选数字

1 个答案: