Question

我有一个HTML页面，其中只有一个<table>代码，但有许多<tr>和<td>代码。

示例：

<tr attributes >
    <td>Name1</td>
    <td>some text</td>
    <td>some text</td>
</tr>                                                            1.
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>Total</td>
    <td>--------</td>
    <td>1989</td>
    <td>some text</td>
</tr>
------------------------------------------------------------------------------
<tr attributes >
    <td>Name2</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>                                            
</tr>
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>Total</td>
    <td>--------</td>
    <td>1979</td>
    <td>some text</td>
</tr>
------------------------------------------------------------------------------
<tr attributes >
    <td>Name3</td>
    <td>some text</td>
    <td>some text</td>
</tr>                                                                  2.
<tr>
    <td>some text</td>
    <td>--------</td>
    <td>some text</td>
    <td>some text</td>
</tr>
<tr>
    <td>Total</td>
    <td>--------</td>
    <td>1089</td>
    <td>some text</td>
</tr>

现在假设我想要 NAME1 和以下 TOTAL 和 之间的行NAME3 以及以下 TOTAL 。

这之间可以有任意数量的行和列......

行和列的大小不固定。

因此输出应包括1.和2.

Answer 1

如果你想让小组将文字与html分开使用这个：

<td>Name(1|3)</td>((\s*<td>([^<]+)</td>\s*)+</tr>(.*?)<tr>)+?\s*<td>Total</td>

你必须添加选项“s”（全点模式）

Answer 2

当他们说你应该使用解析器时，我同意其他人的看法。该解决方案比正则表达式更强大。但是，如果您知道将运行正则表达式的HTML将不会发生太大变化，那么正则表达式方法可以正常工作。要知道即使对HTML进行少量更改也会导致此解决方案稍后失败。例如，如果向任何内部行添加属性，则此正则表达式将找不到匹配项。正则表达式也可以在这种情况下工作，但随后它变得更复杂，更难阅读。

此正则表达式适用于您在问题中提供的示例HTML。使用捕获组1仅获取内部行

<tr\s+[^>]+>\s*<td>Name(?:1|3)</td>(?:\s*<td>[\w\s-]+</td>)+\s*</tr>((?:\s*<tr>(?:\s*<td>[\w\s-]+</td>)+\s*</tr>)+?)\s*<tr>\s*<td>Total</td>(?:\s*<td>[\w\s-]+</td>)+\s*</tr>

以下是正则表达式的粗略细分：

#Matche the first row.
<tr\s+[^>]+>                    #Match the opening TR tag, allow for any attributes found inside the tag.
\s*<td>Name(?:1|3)</td>         #Match the first cell. Only allow its contents to be "Name1" or "Name3".
(?:\s*<td>[\w\s-]+</td>)+       #Match all other cells in this row.
\s*</tr>                        #Match the end of the row.

#Match all rows between the first and last row.
(?:
    \s*<tr>                         #Match the beginning of an inner row.
        (?:\s*<td>[\w\s-]+</td>)+   #Match all the cells in the current row.
    \s*</tr>                        #Match the end of the current row.
)+?

#Match the last row.
\s*<tr>                         #Match the beginning of the last row.
\s*<td>Total</td>               #Match the first cell. Only allow its contents to be "Total".
(?:\s*<td>[\w\s-]+</td>)        #Match all other cells in this row.
+\s*</tr>                       #Match the end of the last row.

如何在Html页面的行之间获取一个字符串，该字符串以某个单词开头并以某个单词结尾

2 个答案: