过去两天我一直在敲一个小宠物项目,包括在Perl中制作一个爬虫。
我没有真正的Perl经验(只有我过去两天学到的经验)。 我的脚本如下:
ACTC.pm:
#!/usr/bin/perl
use strict;
use URI;
use URI::http;
use File::Basename;
use DBI;
use HTML::Parser;
use LWP::Simple;
require LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
$ua->max_redirect(0);
package Crawler;
sub new {
my $class = shift;
my $self = {
_url => shift,
_max_link => 0,
_local => 1,
};
bless $self, $class;
return $self;
}
sub trim{
my( $self, $string ) = @_;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
sub process_image {
my ($self, $process_image) = @_;
$self->{_process_image} = $process_image;
}
sub local {
my ($self, $local) = @_;
$self->{_local} = $local;
}
sub max_link {
my ($self, $max_link) = @_;
$self->{_max_link} = $max_link;
}
sub x_more {
my ($self, $x_more) = @_;
$self->{_x_more} = $x_more;
}
sub resolve_href {
my ( $self, $base, $href) = @_;
my $u = URI->new_abs($href, $base);
return $u->canonical;
}
sub write {
my ( $self, $ref, $data ) = @_;
open FILE, '>c:/perlscripts/' . $ref . '_' . $self->{_process_image} . '.txt';
foreach( $data ) {
print FILE $self->trim($_) . "\n";
}
close( FILE );
}
sub scrape {
my ( @m_error_array, @m_href_array, @href_array, $dbh, $query, $result, $array );
my ( $self, $DBhost, $DBuser, $DBpass, $DBname ) = @_;
if( defined( $self->{_process_image} ) && ( -e 'c:/perlscripts/href_w_' . $self->{_process_image} . ".txt" ) ) {
open ERROR_W, "<c:/perlscripts/error_w_" . $self->{_process_image} . ".txt";
open M_HREF_W, "<c:/perlscripts/m_href_w_" . $self->{_process_image} . ".txt";
open HREF_W, "<c:/perlscripts/href_w_" . $self->{_process_image} . ".txt";
@m_error_array = <ERROR_W>;
@m_href_array = <M_HREF_W>;
@href_array = <HREF_W>;
close ( ERROR_W );
close ( M_HREF_W );
close ( HREF_W );
}else{
@href_array = ( $self->{_url} );
}
my $z = 0;
while( @href_array ){
if( defined( $self->{_x_more} ) && $z == $self->{_x_more} ) {
print "died";
last;
}
my $href = shift( @href_array );
if( defined( $self->{_process_image} ) && scalar @href_array ne 0 ) {
$self->write( 'm_href_w', @m_href_array );
$self->write( 'href_w', @href_array );
$self->write( 'error_w', @m_error_array );
}
$self->{_link_count} = scalar @m_href_array;
my $info = URI::http->new($href);
if( ! defined( $info->host ) ) {
push( @m_error_array, $href );
}else{
my $host = $info->host;
$host =~ s/^www\.//;
$self->{_current_page} = $href;
my $redirect_limit = 10;
my $y = 0;
my( $response, $responseCode );
while( 1 && $y le $redirect_limit ) {
$response = $ua->get($href);
$responseCode = $response->code;
if( $responseCode == 200 || $responseCode == 301 || $responseCode == 302 ) {
if( $responseCode == 301 || $responseCode == 302 ) {
$href = $self->resolve_href( $href, $response->header('Location') );
}else{
last;
}
}else{
last;
}
$y++;
}
if( $y != $redirect_limit && $responseCode == 200 ) {
print $href . "\n";
if( ! defined( $self->{_url_list} ) ) {
my @url_list = ( $href );
}else{
my @url_list = $self->{_url_list};
push( @url_list, $href );
$self->{_url_list} = @url_list;
}
my $DNS = "dbi:mysql:$DBname:$DBhost:3306";
$dbh = DBI->connect($DNS, $DBuser, $DBpass ) or die $DBI::errstr;
$result = $dbh->prepare("INSERT INTO `". $host ."` (URL) VALUES ('$href')");
if( ! $result->execute() ){
$result = $dbh->prepare("CREATE TABLE `". $host ."` ( `ID` INT( 255 ) NOT NULL AUTO_INCREMENT , `URL` VARCHAR( 255 ) NOT NULL , PRIMARY KEY ( `ID` )) ENGINE = MYISAM ;");
$result->execute();
print "Host added: " . $host . "\n";
}
my $content = $response->content;
die "get failed: " . $href if (!defined $content);
my @pageLinksArray = ( $content =~ m/href=["']([^"']*)["']/g );
foreach( @pageLinksArray ) {
my $link = $self->trim($_);
if( $self->{_max_link} != 0 && scalar @m_href_array > $self->{_max_link} ) {
last;
}
my $new_href = $self->resolve_href( $href, $link );
if( $new_href =~ m/^http:\/\// ) {
if( substr( $new_href, -1 ) ne "#" ) {
my $base = $self->{_url};
my %values_index;
@values_index{@m_href_array} = ();
if( ! $new_href =~ m/$base/ ) {
if( $self->{_local} eq "true" && ! exists $values_index{$new_href} ) {
push( @m_href_array, $new_href );
push( @href_array, $new_href );
}
}elsif( $self->{_local} eq "true" && ! exists $values_index{$new_href} ) {
push( @m_href_array, $new_href );
push( @href_array, $new_href );
}
}
}
}
}else{
push( @m_error_array, $href );
}
}
}
}
1;
new_spider.pl:
#!/usr/bin/perl
use strict;
use warnings;
use ACTC;
my ($object, $url, $uri);
print "Starting Poing: (url): ";
chomp($url = <>);
$object = new Crawler( $url );
$object->process_image('process_image_name');
$object->local('true');
$object->max_link(0);
$object->x_more(9999999);
$object->scrape( 'localhost', 'root', '', 'crawl' );
#print $object->{_url} . "\n";
#print $object->{_process_image};
现在还没有完成一些功能无法正常工作但是在运行脚本后我在大约一个小时内索引了1500页,我觉得这很慢。
脚本开始扼杀结果,但现在非常狡猾地每秒吐出一个网址。
任何人都可以提供有关如何提高性能的提示吗?
答案 0 :(得分:3)
大多数情况下,您的程序可能正在等待来自网络的响应。大多数等待时间都没有了(除了将计算机放在要与之交谈的计算机旁边)。分离流程以获取每个URL,以便您可以同时下载它们。您可以考虑诸如Parallel::ForkManager,POE或AnyEvent之类的内容。
答案 1 :(得分:1)
见Brian的答案。
运行大量的副本。使用共享存储系统来保存中间数据和最终数据。
使用爬虫的更多内存密集型部分(HTML解析等)并将它们放在一组单独的进程中可能会有所帮助。
所以有一个读取器池,它从要读取的页面队列中读取,并将它们放入共享存储区域,以及一个解析器进程池,它读取页面并将结果写入结果数据库并将新页面排入要读的队列。
或者其他什么。这实际上取决于您的爬虫的目的。
最终,如果你试图抓取很多页面,你可能需要大量的硬件和非常胖的管道(对你的数据中心/ colo)。因此,您需要一种架构,允许将爬虫的各个部分拆分为多台机器,以便正确扩展。