我不是Perl的专家,但我编写了一个Perl脚本来解析HTML页面并按所有href
标签进行过滤:
输出如下所示:
href="?Name">Name</a>
href="?Desc">Hourly Details</a>
href="/24x7/2012/11-November/">Data
href="./00:00:00/">00:00:00/</a>
href="./01:00:00/">01:00:00/</a>
href="./02:00:00/">02:00:00/</a>
href="./03:00:00/">03:00:00/</a>
href="./04:00:00/">04:00:00/</a>
href="./05:00:00/">05:00:00/</a>
href="./06:00:00/">06:00:00/</a>
href="./07:00:00/">07:00:00/</a>
href="./08:00:00/">08:00:00/</a>
href="./09:00:00/">09:00:00/</a>
href="./10:00:00/">10:00:00/</a>
href="./11:00:00/">11:00:00/</a>
href="./12:00:00/">12:00:00/</a>
href="./13:00:00/">13:00:00/</a>
href="./14:00:00/">14:00:00/</a>
href="./15:00:00/">15:00:00/</a>
href="./16:00:00/">16:00:00/</a>
href="./17:00:00/">17:00:00/</a>
href="./18:00:00/">18:00:00/</a>
href="./19:00:00/">19:00:00/</a>
href="./20:00:00/">20:00:00/</a>
href="./21:00:00/">21:00:00/</a>
href="./22:00:00/">22:00:00/</a>
href="./23:00:00/">23:00:00/</a>
现在我想从“00:00:00”到“23:00:00”提取href标记内的值,同时排除其他值。结果值将添加到具有URL的字符串:
http://x.download.com/00:00:00
------URL------------/..href../
..............................
http://x.download.com/23:00:00
然而,尝试以下代码:
foreach (@tag) {
if (m/href/) {
if ($_ =~ /"\/24/ && $_ =~ /"\/[0-9]/) {
my $href = $_;
my $start = index($href, "\"");
my $end = rindex($href, "\"");
my $link = substr($href, $start + 1, $end - $start - 1);
print "Follow: " . $url . $link . "\n";
}
}
}
打印:
Follow: http://x.download.com/24x7/2012/11-November/
我的正则表达式应该如何实现所需的目标?
答案 0 :(得分:3)
使用正则表达式非常简单,如下面的程序所示。它会在>
之后立即查找一串数字或冒号(因此查找元素的文本内容而不是href
属性值),并将该字符串捕获到$1
但我更愿意看到问题从头到尾使用正确的HTML解析器解决,例如
HTML::TreeBuilder
要么
Mojo::DOM
use strict;
use warnings;
my @tag = <DATA>;
foreach (@tag) {
next unless />([\d:]+)/;
print "http://x.download.com/$1\n";
}
__DATA__
href="?Name">Name</a>
href="?Desc">Hourly Details</a>
href="/24x7/2012/11-November/">Data
href="./00:00:00/">00:00:00/</a>
href="./01:00:00/">01:00:00/</a>
href="./02:00:00/">02:00:00/</a>
href="./03:00:00/">03:00:00/</a>
href="./04:00:00/">04:00:00/</a>
href="./05:00:00/">05:00:00/</a>
href="./06:00:00/">06:00:00/</a>
href="./07:00:00/">07:00:00/</a>
href="./08:00:00/">08:00:00/</a>
href="./09:00:00/">09:00:00/</a>
href="./10:00:00/">10:00:00/</a>
<强>输出强>
http://x.download.com/00:00:00
http://x.download.com/01:00:00
http://x.download.com/02:00:00
http://x.download.com/03:00:00
http://x.download.com/04:00:00
http://x.download.com/05:00:00
http://x.download.com/06:00:00
http://x.download.com/07:00:00
http://x.download.com/08:00:00
http://x.download.com/09:00:00
http://x.download.com/10:00:00
答案 1 :(得分:3)
您不希望使用正则表达式。您需要一个合适的HTML解析器,并且正则表达式无法完成这项工作。
你是如何获取网页的?如果你正在使用WWW :: Mechanize,那么从你提取的页面中提取链接就是单个方法调用,因为WWW :: Mechanize会为你做HTML解析。
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
$mech->get( $url );
my @links = $mech->links();
for my $link ( @links ) {
say $link->text, ' -> ', $link->url; # Show the text and the URL
}
您需要根据需要重新格式化,但这会让您有所了解。
答案 2 :(得分:0)
首先,我们需要指定一个将军事时间捕获到第二个的正则表达式。
my $regex
= qr{ # curly brackets instead of slashes
# so that we can use literal slashes in expression
" # a quote
\. # a literal dot
/ # a forward slash
( # begin capture group
(?: # begin uncaptured sub-group
[01] \d # a '0' or '1' followed by a digit
| 2 [0-3] # a '2' followed by 0-3
) # end grouping
(?: # begin repetition grouping
: # a literal colon
[0-5] \d # digits 0-5 followed by any digit
){2} # exactly twice
) # end capture
/ # a forward slash
" # close quote
}x; # <- x-option allows annotated regex
...
这相当于以下正则表达式:
my $regex = qr/"\.\/((?:[01]\d|2[0-3])(:[0-5]\d){2})\/"/;
如果您的分钟和秒数只能是'00:00',那么表达式就更容易了:
my $regex = qr{"\./((?:[01]\d|2[0-3]):00:00)/"};
然后,您可以通过在列表上下文中进行匹配来测试和检索值:
if ( my ( $link ) = m/$regex/ ) {
say "http://x.download.com/$link";
}
如果测试不匹配,$link
将不确定。如果匹配,则将其声明为列表(一个),匹配操作将第一个捕获分配给变量。