我目前正在尝试使用WWW :: Mechanize创建 Perl webspider。
我想要做的是创建一个将抓取整个网站的webspider (由用户输入)并从网站上的每个页面中提取所有链接。
但我有一个问题,如何蜘蛛整个网站获取每个链接,没有重复 到目前为止我所做的事情(无论如何我都遇到了麻烦):
foreach (@nonduplicates) { #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/); #find all links on this page that starts with http://www.tree.com
#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be
我如何才能完成上述工作?
我这样做是为了尝试蜘蛛整个网站,以获得网站上每个网址的完整列表,没有重复。
如果您认为这不是实现相同结果的最佳/最简单的方法,我愿意接受这些想法。
非常感谢您的帮助,谢谢。
答案 0 :(得分:1)
创建一个哈希来跟踪您之前看过的链接,并将任何看不见的链接放到@nonduplicates
进行处理:
$| = 1;
my $scanned = 0;
my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.
while (my $queued_link = pop @nonduplicates) {
$mech->get($queued_link);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);
for my $new_link (@list) {
# Add the link to the queue unless we already encountered it.
# Increment so we don't add it again.
push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
}
printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);
答案 1 :(得分:0)
use List::MoreUtils qw/uniq/;
...
my @list = $mech->find_all_links(...);
my @unique_urls = uniq( map { $_->url } @list );
现在@unique_urls
包含来自@list
的唯一网址。