Question

我有以下HTML文件结构：

<table>
   <tr class="heading">
      <td colspan="2">
         <h2 class="groupheader">Public Types</h2> 
         <!-- I don't want that! We're in a table.-->
      </td>
   </tr>
   <tr>...</tr> 
</table>
<h2 class="groupheader">Detailed Description</h2>
  <!-- I want all that until the next h2-->
  <div class="textblock"><p>Provides the functions to control the generation of a single data log file. </p>
    <h4>Example</h4>
    <div class="fragment"><div class="line">Test <a href="aaa">stuff</a>();</div>
        <div class="line">...</div>     
        <div class="line">...</div>
    </div>
</div> <!-- end of first result -->

<h2 class="groupheader">Member</h2>
<!-- I want all that until the next h2 or hr-->
<a class="anchor"></a>
<div class="memitem">
<div class="memproto">
      <table class="memname">
        <tr>
          <td class="memname">enum <a class="el" href="...">test</a></td>
        </tr>
      </table>
</div><div class="memdoc">
<hr><!-- End of 2nd result -->

使用Regexp，我需要获得每个标题之间的所有内容，直到下一个标题或hr标记，期望它是否在表格中。

到目前为止，我已经获得了所有h2-＆gt; h2 | hr内容。它就像：

(?s)(<h2 class="groupheader">.*?)(<h2|<hr)

如何跳过表格中包含的H2下的内容？我已经尝试了背后的负面看法，但我没有到达任何地方。

感谢您的帮助。

Answer 1

请注意HTML应该与适当的PARSER一致

现在，因为我们只剩下看起来像HTML的输入和任务

获取每个标题之间的所有内容，直到下一个标题或hr标记，预计它是否在表格中

让我展示一下如何做到。

您可以在tempered greedy token ((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)的帮助下获得所需的子字符串（匹配任何未在其前面的负向前瞻中启动任何替代项的符号 - 因此，保持匹配在<table>边界内 - 并且还匹配内部表格）在最后有一个积极的前瞻：

(?s)<h2 class="groupheader">[^<]*<\/h2>\s*((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)(?=<h2|<hr)

请参阅demo。

请注意，您可以使用h2代替h\d+来支持任何级别的h。

查找正则表达式模式之前没有

1 个答案: