Question

我对正则表达式相当新，并且一直难以使用一个来提取我所追求的数据。具体来说，我希望从以下内容中提取所涉及的日期和计数器：

<span style="color:blue;">&lt;query&gt;</span>
  <span style="color:blue;">&lt;pages&gt;</span>
    <span style="color:blue;">&lt;page pageid=&quot;3420&quot; ns=&quot;0&quot; title=&quot;Test&quot; touched=&quot;2011-07-08T11:00:58Z&quot; lastrevid=&quot;17889&quot; counter=&quot;9&quot; length=&quot;6269&quot; /&gt;</span>
    <span style="color:blue;">&lt;/pages&gt;</span>
  <span style="color:blue;">&lt;/query&gt;</span>
<span style="color:blue;">&lt;/api&gt;</span>

我目前正在使用vs2010。我目前的表达是：

std::tr1::regex rx("(?:.*touch.*;)?([0-9-]+?)(?:T.*count.*;)([0-9]+)(&.*)?");
std::tr1::regex_search(buffer, match, rx);

match [1]包含以下内容：

    2011-07-08T11:00:58Z&quot; lastrevid=&quot;17889&quot; counter=&quot;9&quot; length=&quot;6269&quot; /&gt;</span>
    <span style="color:blue;">&lt;/pages&gt;</span>
  <span style="color:blue;">&lt;/query&gt;</span>
<span style="color:blue;">&lt;/api&gt;</span>

match [2]包含以下内容：

6269&quot; /&gt;</span>
    <span style="color:blue;">&lt;/pages&gt;</span>
  <span style="color:blue;">&lt;/query&gt;</span>
<span style="color:blue;">&lt;/api&gt;</span>

我在比赛[1]中只找到“2011-07-08”，在比赛[2]中找到“9”。日期格式永远不会改变，但计数器几乎肯定会更大。

任何帮助都将受到高度赞赏。

Answer 1

那是因为cmatch::operator[](int i)返回sub_match，其sub_match::operator basic_string()（在cout的上下文中使用）返回从匹配开始处开始并在源字符串的结尾。

使用sub_match::str()，即match[1].str()和match[2].str()。

此外，你需要更具体的表达方式：.*尝试匹配世界，如果不能，则放弃一些。

尝试std::tr1::regex rx("touched="([0-9-]+).+counter="([0-9]+)");。

您甚至可以使用非贪婪的匹配器（例如+?和*?）来防止过度匹配。

Answer 2

尝试

std::tr1::regex rx("(?:.*touch.*;)?([0-9-]+)(?:T.*count.*;)([0-9]+)(&.*)?");

删除问号会使术语变得贪婪，因此它会尽可能多地填充。

帮助正则表达式

2 个答案: