Question

我正在阅读std::regex_iterator<std::string::iterator>上的文档，因为我正在尝试学习如何使用它来解析HTML标记。网站提供的示例是

#include <iostream>
#include <string>
#include <regex>

int main ()
{
  std::string s ("this subject has a submarine as a subsequence");
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

  std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), e );
  std::regex_iterator<std::string::iterator> rend;

  while (rit!=rend) {
    std::cout << rit->str() << std::endl;
    ++rit;
  }

  return 0;
}

（http://www.cplusplus.com/reference/regex/regex_iterator/regex_iterator/）

我有一个问题：如果rend从未初始化，那么它如何在rit!=rend中有意义地使用？

此外，我应该使用哪个工具从HTML标签中获取属性？我想要做的是取一个像"class='class1 class2' id = 'myId' onclick ='myFunction()' >"这样的字符串并闯入成对

（"class"，"class1 class2"），（"id"，"myId"），（"onclick"，"myFunction()"）

然后从那里与他们合作。我计划使用的正则表达式是

([A-Za-z0-9\\-]+)\\s*=\\s*(['\"])(.*?)\\2

所以我打算迭代这种类型的表达，同时跟踪我是否还在标签中（即我是否通过了'>'个字符）。这样做会不会太难？

感谢您提供给我的任何指导。

Answer 1

你是什么意思＆＃34;如果rend从未初始化＆＃34;？显然，std::regex_iterator<I>有一个默认构造函数。由于迭代只是前向迭代，所以结束迭代器只需要适合检测结束使用的东西。默认构造函数可以相应地设置rend。

这是标准C ++库中其他几个地方使用的习语，例如std::istream_iterator<T>。理想情况下，可以使用不同的类型指示结束迭代器（例如，请参阅Eric Niebler's discussion，此链接指向四个页面中的第一个），但标准当前要求使用算法时这两种类型匹配。

关于使用正则表达式解析HTML，请参阅this answer。

Answer 2

rend未初始化，默认构造。您链接的页面清楚地表明：

默认构造函数（1）构造一个序列结束迭代器。

由于default-construction似乎是获取序列结束迭代器的唯一方法，因此将rit与rend进行比较是测试rit是否用尽的正确方法。

根据CPlusPlus.com使用std :: regex_iterator <std :: string :: iterator> </std :: string :: iterator>

2 个答案: