我有很少的抓取应用程序,并试图添加多线程。这是代码(MyMech是用于处理HTTP错误的WWW :: Mechanize子类):
#!/usr/bin/perl
use strict;
use MyMech;
use File::Basename;
use File::Path;
use HTML::Entities;
use threads;
use threads::shared;
use Thread::Queue;
use List::Util qw( max sum );
my $page = 1;
my %CONFIG = read_config();
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get( $CONFIG{BASE_URL} . "/site-map.php" );
my @championship_links =
$mech->find_all_links( url_regex => qr/\d{4}-\d{4}\/$/ );
foreach my $championship_link (@championship_links) {
my @threads;
my $queue = Thread::Queue->new;
my $queue_processed = Thread::Queue->new;
my $url = sprintf $championship_link->url_abs();
print $url, "\n";
next unless $url =~ m{soccer}i;
$mech->get($url);
my ( $last_round_loaded, $current_round ) =
find_current_round( $mech->content() );
unless ($last_round_loaded) {
print "\tLoading rounds data...\n";
$mech->submit_form(
form_id => "leagueForm",
fields => {
round => $current_round,
},
);
}
my @match_links =
$mech->find_all_links( url_regex => qr/matchdetails\.php\?matchid=\d+$/ );
foreach my $link (@match_links) {
$queue->enqueue($link);
}
print "Starting printing thread...\n";
my $printing_thread = threads->create(
sub { printing_thread( scalar(@match_links), $queue_processed ) } )
->detach;
push @threads, $printing_thread;
print "Starting threads...\n";
foreach my $thread_id ( 1 .. $CONFIG{NUMBER_OF_THREADS} ) {
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
push @threads, $thread;
}
undef $queue;
undef $queue_processed;
foreach my $thread ( threads->list() ) {
if ( $thread->is_running() ) {
print $thread->tid(), "\n";
}
}
#sleep 5;
}
print "Finished!\n";
sub printing_thread {
my ( $number_of_matches, $queue_processed ) = @_;
my @fields =
qw (
championship
year
receiving_team
visiting_team
score
average_home
average_draw
average_away
max_home
max_draw
max_away
date
url
);
while ($number_of_matches) {
if ( my $match = $queue_processed->dequeue_nb ) {
open my $fh, ">>:encoding(UTF-8)", $CONFIG{RESULT_FILE} or die $!;
print $fh join( "\t", @{$match}{@fields} ), "\n";
close $fh;
$number_of_matches--;
}
}
threads->exit();
}
sub scrape_match {
my ( $thread_id, $queue, $queue_processed ) = @_;
while ( my $match_link = $queue->dequeue_nb ) {
my $url = sprintf $match_link->url_abs();
print "\t$url", "\n";
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get($url);
my $match = parse_match( $mech->content() );
$match->{url} = $url;
$queue_processed->enqueue($match);
}
return 1;
}
我对这段代码有一些奇怪的看法。有时它会运行但有时会退出而没有错误(在->detach
点)。我知道@match_links包含数据,但是没有创建线程,它只是关闭。通常它会在处理第二个$championship_link
条目后终止。
可能是我做错了什么?
更新
这是find_current_round
子程序的代码(但我确定它与问题无关):
sub find_current_round {
my ($html) = @_;
my ($select_html) = $html =~ m{
<select\s+name="round"[^>]+>\s*
(.+?)
</select>
}isx;
my ( $option_html, $current_round ) = $select_html =~ m{
(<option\s+value="\d+"(?:\s+ selected="selected")?>(\d+)</option>)\Z
}isx;
my ($last_round_loaded) = $option_html =~ m{selected};
return ( $last_round_loaded, $current_round );
}
答案 0 :(得分:0)
首先关闭 - 不要使用dequeue_nb()。这是一个坏主意,因为如果一个队列暂时为空,它将返回undef,你的线程将退出。
改为使用dequeue
和end
。 dequeue
会阻止,但是一旦你end
你的队列,while将退出。
你还在用线程做一些奇怪的事情 - 我建议你很少想要detach
一个线程。您只是假设您的线程将在您的计划之前完成,这不是一个好计划。
同样如此;
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
您正在产生一个线程,然后立即加入它。因此join
调用将阻止等待你的线程退出。你完全不需要线程......
您还可以在foreach循环中对队列进行范围调整。我认为这不是一个好计划。我建议改为 - 将它们放在外部,并产生一定数量的工人&#39;线程(以及一个&#39;打印&#39;线程)。
然后只需通过队列机制提供它们。否则,您最终将创建多个队列实例,因为它们具有词法范围。
一旦你完成排队,就发出一个$queue -> end
来终止while循环。
你也不需要给一个$thread_id
一个帖子,因为......他们已经拥有一个。请尝试:threads -> self -> tid();
。