全球匹配正则表达式挂起

时间:2016-02-26 05:23:06

标签: regex perl hang

我有以下perl代码:

# $content is the text of a webpage
while ($content =~ /rgRow.*?<td>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>(.*?)<\/td><td.*?>.*?<\/td><td.*?>(.*?)<\/td><td.*?><nobr>(.*?)<\/nobr><\/td>/sg) {
   # do stuff
}

我已经知道代码挂在这个正则表达式调用上。它会在while循环中进行大约2-3次迭代,然后它就会挂起。我离开了大约30分钟,但还没有进行。

可能是什么问题?

代码的目的是浏览一些HTML并从中提取一些数据。

以下是我将$content设置为:

的HTML
<tbody>
        <tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__0">
            <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : SECOND PERIODIC REPORT OF STATES PARTIES DUE IN 1974 / MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.65/Add.1</td><td><nobr>21 Feb 1974</nobr></td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl04_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.65%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.65/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__1">
            <td>CONSIDERATION OF REPORTS SUBMITTED BY STATES PARTIES UNDER ARTICLE 9 OF THE CONVENTION : INITIAL REPORTS OF STATES PARTIES WHICH ARE DUE IN 1972 / MOROCCO</td><td>State party's report</td><td>CERD</td><td>Morocco</td><td>CERD/C/R.33/Add.1</td><td><nobr>17 Jan 1972</nobr></td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl06_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=CERD%2fC%2fR.33%2fAdd.1&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">CERD/C/R.33/Add.1</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__2">
            <td>Annex I to ALGERIA's Report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl08_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13691&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13691_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13691</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__3">
            <td>Annex II to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl10_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13692&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13692_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13692</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__4">
            <td>Annex III to ALGERIA's report</td><td>Annex to State party report</td><td>CERD</td><td>Algeria</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl12_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fAIS%2fDZA%2f13693&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_AIS_DZA_13693_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/AIS/DZA/13693</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__5">
            <td>CERD-C-NZ-18-20_Annexes</td><td>Annex to State party report</td><td>CERD</td><td>New Zealand</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl14_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fNZL%2f13731&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_NZL_13731_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/NZL/13731</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__6">
            <td>CERD.C.RUS.20-22_Annex1</td><td>Annex to State party report</td><td>CERD</td><td>Russian Federation</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl16_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fRUS%2f13732&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">R</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_RUS_13732_R.doc</td><td style="display:none;">INT/CERD/ADR/RUS/13732</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__7">
            <td>Annex to State party report</td><td>Annex to State party report</td><td>CERD</td><td>Poland</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl18_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fPOL%2f15432&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">E</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_POL_15432_E.doc</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/POL/15432</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__8">
            <td>Annexe X</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl20_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15561&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15561_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15561</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
        </tr><tr class="rgRow InnerAlernatingItemStyle" id="ctl00_PlaceHolderMain_radResultsGrid_ctl00__9">
            <td>Annexe XI</td><td>Annex to State party report</td><td>CERD</td><td>Belgium</td><td>&nbsp;</td><td>&nbsp;</td><td>
                                            <a id="ctl00_PlaceHolderMain_radResultsGrid_ctl00_ctl22_MoreDocs" title="View document" href="http://tbinternet.ohchr.org/_layouts/treatybodyexternal/Download.aspx?symbolno=INT%2fCERD%2fADR%2fBEL%2f15562&amp;Lang=en" target="_blank" style="text-decoration:underline;">View document</a>&nbsp; 
                                        </td><td style="display:none;">&nbsp;</td><td style="display:none;">F</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT_CERD_ADR_BEL_15562_F.pdf</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">&nbsp;</td><td style="display:none;">INT/CERD/ADR/BEL/15562</td><td style="display:none;">&nbsp;</td><td style="display:none;">True</td>
</tr>
</tbody>

我正在尝试以下一行,看看它是如何改变的:

while ($content =~ m/rgRow.+?<td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td><td>(.+?)<\/td>/gs)

原始代码不是我的。

2 个答案:

答案 0 :(得分:0)

我将此问题作为调试旧代码的问题。 (尽管如此,请参阅解析器示例的结尾。)

报告的问题是正则表达式挂起。对我来说,它在第一行的几场比赛后退出。我的第一个嫌疑人是一条松散的新线; /s修饰符仅使.与新行匹配。另一个嫌疑人是明确匹配的rgRow词组 - 它也是<td>标签中的一个属性,因此在.*下匹配 - 冲突?最后,正则表达式明确地寻找每个单元格,同时也使用/g修饰符。作为参考,这是正则表达式,在带有/sg修饰符的代码中使用。

$patt = qr/rgRow.*? 
    <td>   (.*?)<\/td>
    <td.*?>(.*?)<\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> .*? <\/td> 
    <td.*?>(.*?)<\/td> 
    <td.*?> <nobr>(.*?)<\/nobr> <\/td>
/x; 

通过char挑选源char并不令人愉快,而且它一般都不起作用。我们可以执行以下操作:删除新行,然后将<td>标记的内容捕获到数组中。正则表达式中说明的目的正是为了达到这个目的。 (我改变正则表达式分隔符以避免编辑器着色。)

use warnings;
use strict;

my $msg = 'pulled_from_url';
(my $msg_nonl = $msg) =~ s%\n%%g;

my @raw_cells = $msg_nonl =~ |<td.*?>(.*?)<\/td>|g;

# Once we are at it: strip <nobr>, &nbsp;, drop empty elements
@cells = grep { !/^\s*$/ } map {  s%<\/?nobr>|&nbsp;%%g; $_ } @raw_cells;
# Get links ("View Document") out as well
@content = grep  {  !/<a.*?\/a>/ } @cells;
print "Total of " . scalar(@raw_cells) . " cells. ";
print "Cleaned up, down to " . scalar(@content) . " cells.\n";
print "$_\n" for @content;

这打印细胞&#39;内容,在这里编辑空间

Total of 280 cells. Cleaned up, down to 82 cells.
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1974 / MOROCCO
State party's report
...
21 Feb 1974
...
True
CONSIDERATION OF REPORTS SUBMITTED BY ... DUE IN 1972 / MOROCCO
State party's report
...
17 Jan 1972
...
True

通过检查HTML,我们可以看到内容被正确拉出。

我不是要判断海报的动机,而是要限制。但是,我无法帮助它,但将上述猜测工作和谨慎的源读数与以下内容进行比较。

use HTML::TableExtract;   
my $te = HTML::TableExtract->new( keep_html => 1 );
$te->parse( "<table> " . $msg . "</table>" );
# We have one table, use top-level 'rows()' shorthand method
foreach my $row ($te->rows) {
    print join(',', @$row), "\n";
}

报告相同的280个单元格(添加计数时),并打印与上述步骤相同的行。我只需要查看源代码就可以看到它缺少<table>个标签。 HTML::TableExtract是。的子类 HTML::Parser

答案 1 :(得分:0)

您的正则表达式要求第六列包含<nobr>...</nobr>标记,这些标记仅发生在前两行中。它之后就会挂起,因为非贪婪的量词只能做很多事情。当不可能匹配时,他们就像贪婪的品种一样容易遭受灾难性的回溯。

不要一直依赖.*?,而是要具体了解 想要匹配的内容。在这种情况下,这很简单:您匹配的TD永远不会包含其他标记,因此您可以使用[^<>]*来捕获其内容。实际上,您应该在目前使用.*?的任何地方使用它。

在下面的正则表达式中,我还将NOBR标签设置为可选,并且我将其扩展为与整个开放TR标签相匹配,更多是为了便于阅读。

while ($content =~ 
  m!<tr\s+class="rgRow[^<>]*>\s*
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>[^<>]*</td>
    <td[^<>]*>([^<>]*)</td>
    <td[^<>]*>(?:<nobr>)?([^<>]*)(?:</nobr>)?</td>
  !sxg) {
    # do stuff
}