我理解使用正则表达式解析html并不理想,但我有一个用例。
我有这个报道/ html页面:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>LCOV - .info.cleaned</title>
<link rel="stylesheet" type="text/css" href="gcov.css">
</head>
<body>
<table width="100%" border=0 cellspacing=0 cellpadding=0>
<tr><td class="title">LCOV - code coverage report</td></tr>
<tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr>
<tr>
<td width="100%">
<table cellpadding=1 border=0 width="100%">
<tr>
<td width="10%" class="headerItem">Current view:</td>
<td width="35%" class="headerValue">top level</td>
<td width="5%"></td>
<td width="15%"></td>
<td width="10%" class="headerCovTableHead">Hit</td>
<td width="10%" class="headerCovTableHead">Total</td>
<td width="15%" class="headerCovTableHead">Coverage</td>
</tr>
<tr>
<td class="headerItem">Test:</td>
<td class="headerValue">.info.cleaned</td>
<td></td>
<td class="headerItem">Lines:</td>
<td class="headerCovTableEntry">399</td>
<td class="headerCovTableEntry">1019</td>
<td class="headerCovTableEntryLo">39.2 %</td>
</tr>
<tr>
<td class="headerItem">Date:</td>
<td class="headerValue">2016-11-07</td>
<td></td>
<td class="headerItem">Functions:</td>
<td class="headerCovTableEntry">22</td>
<td class="headerCovTableEntry">67</td>
<td class="headerCovTableEntryLo">32.8 %</td>
</tr>
<tr><td><img src="glass.png" width=3 height=3 alt=""></td></tr>
</table>
</td>
</tr>
<tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr>
</table>
<center>
<table width="80%" cellpadding=1 cellspacing=1 border=0>
<tr>
<td width="50%"><br></td>
<td width="10%"></td>
<td width="10%"></td>
<td width="10%"></td>
<td width="10%"></td>
<td width="10%"></td>
</tr>
<tr>
<td class="tableHead">Directory <span class="tableHeadSort"><img src="glass.png" width=10 height=14 alt="Sort by name" title="Sort by name" border=0></span></td>
<td class="tableHead" colspan=3>Line Coverage <span class="tableHeadSort"><a href="index-sort-l.html"><img src="updown.png" width=10 height=14 alt="Sort by line coverage" title="Sort by line coverage" border=0></a></span></td>
<td class="tableHead" colspan=2>Functions <span class="tableHeadSort"><a href="index-sort-f.html"><img src="updown.png" width=10 height=14 alt="Sort by function coverage" title="Sort by function coverage" border=0></a></span></td>
</tr>
<tr>
<td class="coverFile"><a href="src/index.html">src</a></td>
<td class="coverBar" align="center">
<table border=0 cellspacing=0 cellpadding=1><tr><td class="coverBarOutline"><img src="ruby.png" width=39 height=10 alt="39.2%"><img src="snow.png" width=61 height=10 alt="39.2%"></td></tr></table>
</td>
<td class="coverPerLo">39.2 %</td>
<td class="coverNumLo">399 / 1019</td>
<td class="coverPerLo">32.8 %</td>
<td class="coverNumLo">22 / 67</td>
</tr>
</table>
</center>
<br>
<table width="100%" border=0 cellspacing=0 cellpadding=0>
<tr><td class="ruler"><img src="glass.png" width=3 height=3 alt=""></td></tr>
<tr><td class="versionInfo">Generated by: <a href="http://ltp.sourceforge.net/coverage/lcov.php">LCOV version 1.10</a></td></tr>
</table>
<br>
</body>
</html>
我试图解析这一行的数据:
<td class="headerCovTableEntryLo">39.2 %</td>
为39.2(浮点值)。
我目前正在使用此正则表达式来查找两个匹配的TD&#39>:
<td class="headerCovTableEntryLo">[0-9.].*?.%<\/td>
我误解了小组的工作方式。我试过了:
(<td class="headerCovTableEntryLo">[0-9.].*?.%<\/td>)[0-9.].*?\1
要获取在第一组中找到的内容并仅获取数字值但我没有匹配。任何人都可以对我做错了什么有所了解吗?
答案 0 :(得分:2)
这是你想要表演的吗? (仅捕获浮动值):
<(td) class="headerCovTableEntryLo">([0-9.]+)\s?%<\/\1>
看到它在这里工作:https://regex101.com/r/qprROm/2
如果是这样,如果您尝试重复使用第一个匹配项,则会使用\1
或其他方式正确使用它以匹配哪个捕获的组。但是在你的试验中你还捕获了在结束标记时不匹配的类。
不确定这是否是您尝试做的事情。哈哈
另外,在这种情况下,执行<(td)>(.*?)<\/\1>
并没有多大意义。如果您的用例类似于此<(td|th|tr)>(.*?)<\/\1>
最后,如果我这样做,我宁愿这样做以获得更大的灵活性:(?<=class="headerCovTableEntryLo">)([0-9.]+)(?=\s?%)