Question

不确定如何准确地标题这个问题 - 我愿意接受建议。很明显，我的正则表达式出了问题。

我正在使用带有选项的.NET 4.6.2 Regex类：

RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline

输入如下：

<!--malformed HTML beyond my control-->
<table summary="Profile Information" width="100%">
    <tr>
        <td height="5" colspan="2" scope="row"></td>
    </tr>
    <tr>
        <td colspan="2" scope="row"><font size="4"><b>Profile</b></font></td>
    </tr>
    <tr>
        <td valign="top" scope="row">Name: </td>
        <td align="right">Bob Smith</td>
    </tr>
    <tr>
        <td height="5" colspan="2" scope="row"></td>
    </tr>
    <tr>
        <td colspan="2" scope="row"><font size="4"><b>Personal Information</b></font></td>
    </tr>
    <tr>
        <td valign="top" scope="row">Position: </td>
        <td valign="bottom" align="right">IT Director</td>
    </tr>
    <tr>
        <td valign="top" scope="row">Address: </td>
        <td valign="bottom" align="right">1234 Main St
                    Austin, TX
        </td>
    </tr>
</table>
<!--malformed HTML beyond my control-->

我的正则表达式如下：

<tr>\s*<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>\s*</tr>

我希望它能匹配定义了两个单元格的表格行的值，并跳过只定义了一个单元格的行。此外，我希望它能捕获属性名称（即Name:，Position:，Address:）以及与之关联的值。

相反，我得到以下捕获：

匹配字符串 <tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row">Profile</td> </tr> <tr> <td valign="top" scope="row">Name: </td> <td align="right">Bob Smith</td> </tr>

$ 1 </td> </tr> <tr> <td colspan="2" scope="row">Profile</td> </tr> <tr> <td valign="top" scope="row">Name:

$ 2 Bob Smith
匹配字符串 <tr> <td height="5" colspan="2" scope="row"></td> </tr> <tr> <td colspan="2" scope="row">Personal Information</td> </tr> <tr> <td valign="top" scope="row">Position: </td> <td valign="bottom" align="right">IT Director</td> </tr> 的 $ 1 </td> </tr> <tr> <td colspan="2" scope="row">Personal Information</td> </tr> <tr> <td valign="top" scope="row">Position: 的 $ 2 IT Director
匹配字符串 <tr> <td valign="top" scope="row">Address: </td> <td valign="bottom" align="right">1234 Main St Austin, TX </td> </tr> 的 $ 1 Address: 的 $ 2 1234 Main St Austin, TX

我为无法将结果变成更简洁的格式而道歉。表格不允许显示问题。

我认为可能出错

在我看来，我的一个点匹配器比我希望它匹配的更多。我告诉他们不要贪婪(.*?)，所以我有点困惑，为什么他们似乎匹配超出第一个遇到的结束标签。

据我所知，这绝不应该是任何匹配：

<tr>
<td height="5" colspan="2" scope="row"></td>
</tr>

然而，它出现在第一个匹配的字符串中。

我错过了什么？该如何实现？

如果此问题需要任何其他信息，请与我们联系。

P.S。我一直在使用http://regexstorm.net/tester来尝试和调试问题。

Answer 1

非贪婪的比赛不会影响第一场比赛的行为。如果在给定位置存在贪婪的比赛，那么在该位置也将存在非贪婪的比赛。您可以通过不匹配任何</td> s

来破解它

<tr>\s*<td.*?>((?:(?!</td>).)*?)</td>\s*<td.*?>((?:(?!</td>).)*?)</td>\s*</tr>

但我实际上分两步完成，先匹配：

<tr>(.*?)</tr>

然后在每个内容中，检查其余的简单表达式。

Answer 2

试试。*？代替。* 这应该禁用贪婪的前瞻

试试这个：

string regTR = @"<tr>(.+?)</tr>";
Regex ItemRegex = new Regex(regTR, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

var matches = ItemRegex.Matches(readText);
foreach (Match ItemMatch in matches)
{
   string outer = ItemMatch.Groups[0].Value;
   string innerRegex = @"<td.*?>(.*?)</td>\s*<td.*?>(.*?)</td>";

   Match match = Regex.Match(outer, innerRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

   if (match.Success)
   {
        string inner1 = match.Groups[1].Value;
        string inner2 = match.Groups[2].Value;                    
   }
}

为什么.NET中的这个正则表达式比我想要的更多？

我认为可能出错

2 个答案: