我创建了以下perl脚本来从Web中提取URL:
#!perl
use strict;
use warnings;
use List::MoreUtils qw( uniq );
use WWW::Mechanize qw( );
my ($url) = @ARGV;
my $mech = WWW::Mechanize->new();
sub getUrl {
my $request= "@_";
my $response = $mech->get($request);
return $response->is_success() or die($response->status_line() . "\n");
}
sub getLinks {
getUrl($url);
my @root= map { "$_\n" } sort { $a cmp $b } uniq
map { $_->url_abs() }
$mech->links();
return @root;
}
print Dumper(getLinks());
是否有解决方法如何从HTML网站中提取唯一的URL和相关链接文本?
答案 0 :(得分:1)
查看HTML::LinkExtor - 从HTML文档中提取链接
请参阅模块中的Example,对您有所帮助。
答案 1 :(得分:1)
my $urls;
my @result;
foreach my $link ( $mech->links() ) {
next if exists $urls->{ $link->url_abs() };
push @result, {
url => $link->url_abs(),
text => $link->text(),
};
$urls->{ $link->url_abs() } = 1;
}
#now you have all unique links in the array of hashes @result
#so you can sort this array like you want...