我有一个非常大的HTML文件。我需要在变量中提取特定的<div>...</div>
部分。
##some contents
<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036 (Number Of Cores: 4
; CPU Clock Speed: 3500
Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>
##some contents
我使用了正则表达式,比如
my $html_filepath = "G:\\Report.html";
open(HTML, "<$html_filepath") or die "Can't open $html_filepath $!\n";
$body .= "\nTest Report Summary:\n\n";
my $content;
my $summarySection;
{
local $/ = undef; # slurp mode
$content = <HTML>;
}
$content =~ s/\r\n//g;
#print $content;
if ($content ne "")
{
if ($content =~ m/<div class="title-bar" (.*)/)
#if ( $last_line =~ m/^<tr> <td>(\d+)<\/td>/ )
{
$summarySection = "$1";
}
}
print "\n $summarySection";
我得到的输出是:
<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036 (Number Of Cores: 4
; CPU Clock Speed: 3500
Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
但我需要像
这样的输出<div class="title-bar" onclick="folder(c_1)"><table class="layout"><tr><td class="h1" width="400">Summary Of Test Report <br>(E:\Packages\SamplePackage)</td><td><a style="cursor:hand;text-decoration:none;" onclick="showTOC()"><div style="float:left"><div style="float:right"><div style="float:left"></div></div></a></td></tr></table></div><div expandable="1" id="c_1"><a name="title"></a><table class="content" cellpadding="2"><tr><td><table id="details"><tr><td class="h4">Package Name:</td><td class="info">E:\Packages\SamplePackage</td></tr><tr><td class="h4">OS:</td><td class="info">Microsoft Windows Server 2008 R2 Standard </td></tr><tr><td class="h4">Testing:</td><td class="info">Regression Test</td></tr><tr><td class="h4">Machine Name:</td><td class="info">XYZTST036 (Number Of Cores: 4
; CPU Clock Speed: 3500
Mhz; Memory: 32,494 MB)</td></tr><tr><td class="h4">Duration:</td><td class="info">00:28:31</td></tr><tr><td class="h4">Total No. Of Testcases:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Executed:</td><td class="info">54</td></tr><tr><td class="h4">No. Of Testcases Passed:</td><td class="info">42</td></tr><tr><td class="h4">No. Of Testcases Failed:</td><td class="info">0</td></tr><tr><td class="h4">No. Of Testcases NA(Not Appplicable):</td><td class="info">12</td></tr><tr><td class="h4">Skipped Testcases:</td><td class="info"><a href="SkippedTestcaseDetails.html">None</a></td></tr><tr><td class="h4">Date:</td><td class="info">8-02-2016
</td></tr><tr><td class="h4">Start Time(17:58:02)/ Completion Time (18:26:33)</td><td class="info"></td></tr></table></td></tr></table></div></div>
我试过以下正则表达式,
if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/)
但这不起作用。
请给我一些想法,以获取内容,包括换行符,换行符和空格。
答案 0 :(得分:4)
请don't use regexp to parse HTML。使用perl模块解析HTML。
use strict;
use warnings;
use HTML::TreeBuilder 5 -weak; # Ensure weak references
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($html_filepath);
my $elem = $tree->look_down('_tag' => 'div', 'class' => 'title-bar');
warn $elem->as_HTML;
正则表达式的问题是.
与换行符不匹配。阅读本文以了解如何匹配所有字符:Regex to match any character including new lines
解决此问题的方法是使用s
(将字符串视为单行)修饰符:
if ($content =~ m/<div class="title-bar" (.*)<\/table><\/div><\/div>/s)