正则表达式 - 如何正确获取嵌套值

时间:2016-11-09 01:47:13

标签: regex

我理解使用正则表达式解析html并不理想,但我有一个用例。

我有这个报道/ html页面:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html lang="en">

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <title>LCOV - .info.cleaned</title>
  <link rel="stylesheet" type="text/css" href="gcov.css">
</head>

<body>

  <table width="100%" border=0 cellspacing=0 cellpadding=0>
    <tr><td class="title">LCOV - code coverage report</td></tr>
    <tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr>

    <tr>
      <td width="100%">
        <table cellpadding=1 border=0 width="100%">
          <tr>
            <td width="10%" class="headerItem">Current view:</td>
            <td width="35%" class="headerValue">top level</td>
            <td width="5%"></td>
            <td width="15%"></td>
            <td width="10%" class="headerCovTableHead">Hit</td>
            <td width="10%" class="headerCovTableHead">Total</td>
            <td width="15%" class="headerCovTableHead">Coverage</td>
          </tr>
          <tr>
            <td class="headerItem">Test:</td>
            <td class="headerValue">.info.cleaned</td>
            <td></td>
            <td class="headerItem">Lines:</td>
            <td class="headerCovTableEntry">399</td>
            <td class="headerCovTableEntry">1019</td>
            <td class="headerCovTableEntryLo">39.2 %</td>
          </tr>
          <tr>
            <td class="headerItem">Date:</td>
            <td class="headerValue">2016-11-07</td>
            <td></td>
            <td class="headerItem">Functions:</td>
            <td class="headerCovTableEntry">22</td>
            <td class="headerCovTableEntry">67</td>
            <td class="headerCovTableEntryLo">32.8 %</td>
          </tr>
          <tr><td><img src="glass.png" width=3 height=3 alt=""></td></tr>
        </table>
      </td>
    </tr>

    <tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr>
  </table>

  <center>
  <table width="80%" cellpadding=1 cellspacing=1 border=0>

    <tr>
      <td width="50%"><br></td>
      <td width="10%"></td>
      <td width="10%"></td>
      <td width="10%"></td>
      <td width="10%"></td>
      <td width="10%"></td>
    </tr>

    <tr>
      <td class="tableHead">Directory <span class="tableHeadSort"><img src="glass.png" width=10 height=14 alt="Sort by name" title="Sort by name" border=0></span></td>
      <td class="tableHead" colspan=3>Line Coverage <span class="tableHeadSort"><a href="index-sort-l.html"><img src="updown.png" width=10 height=14 alt="Sort by line coverage" title="Sort by line coverage" border=0></a></span></td>
      <td class="tableHead" colspan=2>Functions <span class="tableHeadSort"><a href="index-sort-f.html"><img src="updown.png" width=10 height=14 alt="Sort by function coverage" title="Sort by function coverage" border=0></a></span></td>
    </tr>
    <tr>
      <td class="coverFile"><a href="src/index.html">src</a></td>
      <td class="coverBar" align="center">
        <table border=0 cellspacing=0 cellpadding=1><tr><td class="coverBarOutline"><img src="ruby.png" width=39 height=10 alt="39.2%"><img src="snow.png" width=61 height=10 alt="39.2%"></td></tr></table>
      </td>
      <td class="coverPerLo">39.2&nbsp;%</td>
      <td class="coverNumLo">399 / 1019</td>
      <td class="coverPerLo">32.8&nbsp;%</td>
      <td class="coverNumLo">22 / 67</td>
    </tr>
  </table>
  </center>
  <br>

  <table width="100%" border=0 cellspacing=0 cellpadding=0>
    <tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr>
    <tr><td class="versionInfo">Generated by: <a href="http://ltp.sourceforge.net/coverage/lcov.php">LCOV version 1.10</a></td></tr>
  </table>
  <br>

</body>
</html>

我试图解析这一行的数据:

    <td class="headerCovTableEntryLo">39.2 %</td>

为39.2(浮点值)。

我目前正在使用此正则表达式来查找两个匹配的TD&#39>:

<td class="headerCovTableEntryLo">[0-9.].*?.%<\/td>

我误解了小组的工作方式。我试过了:

(<td class="headerCovTableEntryLo">[0-9.].*?.%<\/td>)[0-9.].*?\1

要获取在第一组中找到的内容并仅获取数字值但我没有匹配。任何人都可以对我做错了什么有所了解吗?

1 个答案:

答案 0 :(得分:2)

这是你想要表演的吗? (仅捕获浮动值):

<(td) class="headerCovTableEntryLo">([0-9.]+)\s?%<\/\1>

看到它在这里工作:https://regex101.com/r/qprROm/2

如果是这样,如果您尝试重复使用第一个匹配项,则会使用\1或其他方式正确使用它以匹配哪个捕获的组。但是在你的试验中你还捕获了在结束标记时不匹配的类。

不确定这是否是您尝试做的事情。哈哈

另外,在这种情况下,执行<(td)>(.*?)<\/\1>并没有多大意义。如果您的用例类似于此<(td|th|tr)>(.*?)<\/\1>

,则更有用

最后,如果我这样做,我宁愿这样做以获得更大的灵活性:(?<=class="headerCovTableEntryLo">)([0-9.]+)(?=\s?%)

在此处查看:https://regex101.com/r/qprROm/3