我正在尝试编写一个最小的网络爬虫。目的是从种子中发现新URL并进一步抓取这些新URL。代码如下:
use strict;
use warnings;
use Carp;
use Data::Dumper;
use WWW::Mechanize;
my $url = "http://foobar.com"; # example
my %links;
my $mech = WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
my @cr_fronteir = $mech->find_all_links();
foreach my $links (@cr_fronteir) {
if ( $links->[0] =~ m/^http/xms ) {
$links{$links->[0]} = $links->[1];
}
}
我被困在这里,如何进一步抓取%链接中的链接,以及如何添加深度以防止溢出。建议表示赞赏。
答案 0 :(得分:5)
Mojolicious网络框架提供了一些对网络抓取工具有用的有趣功能:
fork()
开销的并发请求)这是一个递归爬网本地Apache文档并显示页面标题和提取的链接的示例。它使用4个并行连接,并且不会超过3个路径级别,只访问每个提取的链接一次:
#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);
use Mojo::UserAgent;
# FIFO queue
my @urls = (Mojo::URL->new('http://localhost/manual/'));
# User agent following up to 5 redirects
my $ua = Mojo::UserAgent->new(max_redirects => 5);
# Track accessed URLs
my %uniq;
my $active = 0;
sub parse {
my ($tx) = @_;
# Request URL
my $url = $tx->req->url;
say "\n$url";
say $tx->res->dom->at('html title')->text;
# Extract and enqueue URLs
for my $e ($tx->res->dom('a[href]')->each) {
# Validate href attribute
my $link = Mojo::URL->new($e->{href});
next if 'Mojo::URL' ne ref $link;
# "normalize" link
$link = $link->to_abs($tx->req->url)->fragment(undef);
next unless $link->protocol =~ /^https?$/x;
# Don't go deeper than /a/b/c
next if @{$link->path->parts} > 3;
# Access every link only once
next if ++$uniq{$link->to_string} > 1;
# Don't visit other hosts
next if $link->host ne $url->host;
push @urls, $link;
say " -> $link";
}
return;
}
sub get_callback {
my (undef, $tx) = @_;
# Parse only OK HTML responses
$tx->res->code == 200
and
$tx->res->headers->content_type =~ m{^text/html\b}ix
and
parse($tx);
# Deactivate
--$active;
return;
}
Mojo::IOLoop->recurring(
0 => sub {
# Keep up to 4 parallel crawlers sharing the same user agent
for ($active .. 4 - 1) {
# Dequeue or halt if there are no active crawlers anymore
return ($active or Mojo::IOLoop->stop)
unless my $url = shift @urls;
# Fetch non-blocking just by adding
# a callback and marking as active
++$active;
$ua->get($url => \&get_callback);
}
}
);
# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;
更多网页抓取技巧&技巧,阅读I Don’t Need No Stinking API: Web Scraping For Fun and Profit文章。
答案 1 :(得分:4)
如果不将其作为函数,就不能进行递归。
use strict;
use warnings;
use Carp; #unused, but I guess yours was a sample
use Data::Dumper;
use WWW::Mechanize;
my %links;
my $mech = WWW::Mechanize->new(autocheck => 1);
sub crawl {
my $url = shift;
my $depth = shift or 0;
#this seems like a good place to assign some form of callback, so you can
# generalize this function
return if $depth > 10; #change as needed
$mech->get($url);
my @cr_fronteir = $mech->find_all_links();
#not so sure what you're trying to do; before, $links in the
# foreach overrides the global %links
#perhaps you meant this...?
foreach my $link (@cr_fronteir) {
if ($link->[0] =~ m/^http/xms) {
$links{$link->[0]} = $link->[1];
#be nice to servers - try not to overload them
sleep 3;
#recursion!
crawl( $link->[0], depth+1 );
}
}
}
crawl("http://foobar.com", 0);
我没有在这个分区上安装Perl,所以这很容易出现语法错误和其他恶作剧,但可以作为基础。
正如在第一个函数注释中所说的那样:您可以通过传递回调函数来概括您的函数以获得更大的荣耀,而不是对映射功能进行硬编码,并为您抓取的每个链接调用它。
答案 2 :(得分:0)
一些伪代码:
while ( scalar @links ) {
my $link = shift @links;
process_link($link);
}
sub process_link {
my $link = shift;
$mech->get($link);
foreach my $page_link ( $mech->find_all_links() ) {
next if $links{$page_link};
$links{$page_links} = 1;
push @links, $page_link;
}
}
P上。您的代码中不需要使用/m
和/s
修饰符(以及/x
)。