在HTML标记之间提取文本()

时间:2016-12-21 01:06:28

标签: html perl parsing

我正在尝试在字符串中的所有<tr> </tr>标记之间提取文本并打印它们。

use strict;
use warnings;

my $HTML = '
<tr data_1="15,12,2016" data_2="1">
<td class="cl_1">11111</td>
<td class="cl_2">11111</td>
<td class="cl_3"><strong>11111</strong></td>
<td class="cl_4" colspan="3">11111</td>
</tr>
<tr data_1="16,12,2016" data_2="0">
<td class="cl_1">22222</td>
<td class="cl_2">22222</td>
<td class="cl_3"><strong>22222</strong></td>
<td class="cl_4" colspan="3">22222</td>
</tr>
<tr data_1="15,12,2016" data_2="1">
<td class="cl_1">33333</td>
<td class="cl_2">33333</td>
<td class="cl_3"><strong>33333</strong></td>
<td class="cl_4" colspan="3">33333</td>
</tr>
';

while($HTML =~ /data_2="1">(.*)<\/tr>(\R)/sg) {
    print "$1\n\n";
}

输出应为:

<td class="cl_1">11111</td>
<td class="cl_2">11111</td>
<td class="cl_3"><strong>11111</strong></td>
<td class="cl_4" colspan="3">11111</td>

<td class="cl_1">33333</td>
<td class="cl_2">33333</td>
<td class="cl_3"><strong>33333</strong></td>
<td class="cl_4" colspan="3">33333</td>

如何执行此操作并从每个<tr>标记中提取内容?

1 个答案:

答案 0 :(得分:0)

编辑答案以包含需要<tr>'s的新限制

my $HTML = '
<tr data_1="15,12,2016" data_2="1">
<td class="cl_1">11111</td>
<td class="cl_2">11111</td>
<td class="cl_3"><strong>11111</strong></td>
<td class="cl_4" colspan="3">11111</td>
</tr>
<tr data_1="16,12,2016" data_2="0">
<td class="cl_1">22222</td>
<td class="cl_2">22222</td>
<td class="cl_3"><strong>22222</strong></td>
<td class="cl_4" colspan="3">22222</td>
</tr>
<tr data_1="15,12,2016" data_2="1">
<td class="cl_1">33333</td>
<td class="cl_2">33333</td>
<td class="cl_3"><strong>33333</strong></td>
<td class="cl_4" colspan="3">33333</td>
</tr>
';

while($HTML =~ /<tr[^>]*data_2=\"1\"[^>]*>(.*?)<\/tr>/msg) {
    print "$1\n\n"; }

输出:

<td class="cl_1">11111</td>
<td class="cl_2">11111</td>
<td class="cl_3"><strong>11111</strong></td>
<td class="cl_4" colspan="3">11111</td>

<td class="cl_1">33333</td>
<td class="cl_2">33333</td>
<td class="cl_3"><strong>33333</strong></td>
<td class="cl_4" colspan="3">33333</td>