用于从HTML代码中提取特定div部分的Perl脚本

时间:2016-02-23 12:54:11

标签: html regex perl whitespace line-breaks

我有一个非常大的HTML文件。我需要在变量中提取特定的<div>...</div>部分。

##some contents
<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>
##some contents

我使用了正则表达式,比如

my $html_filepath = "G:\\Report.html";
open(HTML, "<$html_filepath") or die "Can't open $html_filepath $!\n";
$body .= "\nTest Report Summary:\n\n";
my $content;
my $summarySection;
{
    local $/ = undef; # slurp mode
    $content = <HTML>;
}
$content =~ s/\r\n//g;
#print $content;

if ($content ne "")
{
    if ($content =~ m/<div class="title-bar" (.*)/)
    #if ( $last_line =~ m/^<tr> <td>(\d+)<\/td>/ )
    {
        $summarySection = "$1";
    }
}
print "\n $summarySection";

我得到的输出是:

<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016

但我需要像

这样的输出
<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036   (Number Of Cores: 4              

; CPU Clock Speed: 3500           

  Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>

我试过以下正则表达式,

if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/)

但这不起作用。

请给我一些想法,以获取内容,包括换行符,换行符和空格。

1 个答案:

答案 0 :(得分:4)

don't use regexp to parse HTML。使用perl模块解析HTML。

HTML::TreeBuilder

use strict;
use warnings;
use HTML::TreeBuilder 5 -weak; # Ensure weak references

my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($html_filepath);
my $elem = $tree->look_down('_tag' => 'div', 'class' => 'title-bar');
warn $elem->as_HTML;

正则表达式的问题是.与换行符不匹配。阅读本文以了解如何匹配所有字符:Regex to match any character including new lines

解决此问题的方法是使用s(将字符串视为单行)修饰符:

if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/s)