Perl mechanize查找所有链接数组循环问题

时间:2012-10-31 20:03:11

标签: perl web-crawler www-mechanize

我目前正在尝试使用WWW :: Mechanize创建 Perl webspider。

我想要做的是创建一个将抓取整个网站的webspider (由用户输入)并从网站上的每个页面中提取所有链接。

但我有一个问题,如何蜘蛛整个网站获取每个链接,没有重复 到目前为止我所做的事情(无论如何我都遇到了麻烦):

foreach (@nonduplicates) {   #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);  #find all links on this page that starts with http://www.tree.com

#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be 

我如何才能完成上述工作?

我这样做是为了尝试蜘蛛整个网站,以获得网站上每个网址的完整列表,没有重复。

如果您认为这不是实现相同结果的最佳/最简单的方法,我愿意接受这些想法。

非常感谢您的帮助,谢谢。

2 个答案:

答案 0 :(得分:1)

创建一个哈希来跟踪您之前看过的链接,并将任何看不见的链接放到@nonduplicates进行处理:

$| = 1;
my $scanned = 0;

my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.

while (my $queued_link = pop @nonduplicates) {
    $mech->get($queued_link);
    my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

    for my $new_link (@list) {
        # Add the link to the queue unless we already encountered it.
        # Increment so we don't add it again.
        push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
    }
    printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);

答案 1 :(得分:0)

use List::MoreUtils qw/uniq/;
...

my @list = $mech->find_all_links(...);

my @unique_urls = uniq( map { $_->url } @list );

现在@unique_urls包含来自@list的唯一网址。