Question

我正在尝试使用正则表达式来捕获html文件中所有ul标签之间的所有文本。此模式可与像li这样的内联标签一起正常工作，但如果文本包含多行，则该模式将无效。谢谢

   int main()
     {

        string fname = "test.html";
        file_to_string fts(fname);
        std::regex item_names ("<ul>(.*?)</ul>");
        string s = fts.get_string();
        std::regex_token_iterator<std::string::iterator> rend;
        std::regex_token_iterator<std::string::iterator> b ( s.begin(), s.end(), item_names );


    while (b!=rend)
        {cout<<"\""<< *b++<<"\" ;"<<endl;}
     return 0;}

Answer 1

您的正则表达式是正确的，但您需要使用s-flag（点与换行符匹配）。但是基本的c ++风格都不支持它，因此您可以对其进行调整以覆盖\ s \ S而不是dot（。），这意味着您将接受非空格和空格字符！

样本源（run it here）：

#include <regex>
#include <string>
#include <iostream>
using namespace std;

int main()
{
    string input =R"(This text is <ul>pretty long, but will be 
      concatenated into just a single string. 
       The disadvantage is that you have to quote 
      each part, and </ul>newlines must be literal as 
      usual.)";

    string regx = R"(<ul>([\s\S]*?)<\/ul>)";
    smatch matches;
    if (regex_search(input, matches, regex(regx)))
    {
        cout<<matches[1]<<"."<<endl;
    }

    return 0;
}

Regex Demo

Answer 2

我建议像这样使用带有{：<ul>([\s\S]*?)<\/ul> 由于标记不区分大小写，因此我们应该使用i | icase不区分大小写的标记。

Sample code:

#include <iostream>
#include <iterator>
#include <regex>
int main()
{
   std::string html = "<ul><a href=\"http://stackoverflow.com\">SO</a></ul> "
                      "<ul>abc</ul>\n";
   std::regex url_re(R"(<ul>([\s\S]*?)<\/ul>)", std::regex::icase);
   std::copy( std::sregex_token_iterator(html.begin(), html.end(), url_re, 1),
              std::sregex_token_iterator(),
              std::ostream_iterator<std::string>(std::cout, "\n"));
}

C ++正则表达式获取2个标记之间的所有文本，包括新行和空格

2 个答案: