Perl WWW:Mechanize / HTML:TokeParser并跟踪/存储来自href attr的URL

时间:2012-02-21 16:17:10

标签: perl web-scraping perl-module www-mechanize

由于本网站的帮助,我在Perl方面取得了一些进展,但我遇到了问题。我正在抓取的其中一个页面已经改变,我现在无法弄清楚如何实现这一目标。我想要做的是存储我想要访问的每个页面的链接。问题是这些链接在源代码中的href属性标记内,我不知道如何提取它们。有谁可以帮助我?

我需要的链接来自此页面的第316到354行(源代码)http://www.soccerbase.com/teams/home.sd

我需要基本上提取变量的链接,以便在我的其他脚本中使用。如上所述,我正在使用WWW :: Mechanize和HTML :: TokeParser,希望我可以使用这些方法,但目前无法解决。提前谢谢!

1 个答案:

答案 0 :(得分:0)

method find_all_links in WWW::Mechanize。无需使用解析器手动打扰。你可能想要放松正则表达式,这样你就可以同时获得所有~1000个团队。

use WWW::Mechanize qw();
my $w = WWW::Mechanize->new;
$w->get('http://www.soccerbase.com/teams/home.sd');
for my $link ($w->find_all_links(url_regex => qr/comp_id=1\b/)) {
    # 20 instances of WWW::Mechanize::Link
    printf "URL=%s\tTeam=%s\n", $link->url_abs, $link->text
}

URL=http://www.soccerbase.com/tournaments/tournament.sd?comp_id=1       Team=Premier League
URL=http://www.soccerbase.com/teams/team.sd?team_id=142&comp_id=1       Team=Arsenal
URL=http://www.soccerbase.com/teams/team.sd?team_id=154&comp_id=1       Team=Aston Villa
URL=http://www.soccerbase.com/teams/team.sd?team_id=308&comp_id=1       Team=Blackburn
URL=http://www.soccerbase.com/teams/team.sd?team_id=354&comp_id=1       Team=Bolton
URL=http://www.soccerbase.com/teams/team.sd?team_id=536&comp_id=1       Team=Chelsea
URL=http://www.soccerbase.com/teams/team.sd?team_id=942&comp_id=1       Team=Everton
URL=http://www.soccerbase.com/teams/team.sd?team_id=1055&comp_id=1      Team=Fulham
URL=http://www.soccerbase.com/teams/team.sd?team_id=1563&comp_id=1      Team=Liverpool
URL=http://www.soccerbase.com/teams/team.sd?team_id=1718&comp_id=1      Team=Man City
URL=http://www.soccerbase.com/teams/team.sd?team_id=1724&comp_id=1      Team=Man Utd
URL=http://www.soccerbase.com/teams/team.sd?team_id=1823&comp_id=1      Team=Newcastle
URL=http://www.soccerbase.com/teams/team.sd?team_id=1855&comp_id=1      Team=Norwich
URL=http://www.soccerbase.com/teams/team.sd?team_id=2093&comp_id=1      Team=QPR
URL=http://www.soccerbase.com/teams/team.sd?team_id=2477&comp_id=1      Team=Stoke
URL=http://www.soccerbase.com/teams/team.sd?team_id=2493&comp_id=1      Team=Sunderland
URL=http://www.soccerbase.com/teams/team.sd?team_id=2513&comp_id=1      Team=Swansea
URL=http://www.soccerbase.com/teams/team.sd?team_id=2590&comp_id=1      Team=Tottenham
URL=http://www.soccerbase.com/teams/team.sd?team_id=2744&comp_id=1      Team=West Brom
URL=http://www.soccerbase.com/teams/team.sd?team_id=2783&comp_id=1      Team=Wigan
URL=http://www.soccerbase.com/teams/team.sd?team_id=2848&comp_id=1      Team=Wolves