试图弄清楚如何将单独链接列表的每个链接中包含的特定链接推送到数组中

时间:2014-06-04 19:35:19

标签: arrays perl web-scraping web-crawler html-treebuilder


一般想法


以下是我正在使用的内容片段:

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;

foreach (@blarg_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'foo',
                class => 'bar'
        );
        foreach (@temp_stuff) {
                push(@collector, "http://www.foobar.sx" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
        };
};

希望很明显,我无望地尝试做的是将每个链接列表中的链接结尾推送到名为@temp_stuff的数组中。因此,访问时@blarg_links中的第一个链接包含大于或等于1个foo标记,其中bar类与as_HTML作用时匹配{I}}希望在href等于然后加入一系列链接,这些链接具有我真正追求的数据......这有意义吗?


实际数据


my $url2 = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $page2 = get( $url2 ) or die $!;
my $p2 = HTML::TreeBuilder->new_from_content( $page2 );

my @stuff2 = $p2->look_down(
        _tag => 'div',
        class => 'year mini-day-on'
);

my @chem_links;

foreach (@stuff2) {
        push(@chem_links, $1) if $_->as_HTML =~ m/(http:\/\/www\.chemistry\.ucla\.edu\/calendar-node-field-date\/day\/[0-9]{4}-[0-9]{2}-[0-9]{2})/;
};

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;

foreach (@chem_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'span',
                class => 'field-content'
        );
};

foreach (@temp_stuff) {
                push(@collector, "http://www.chemistry.ucla.edu" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
};

n.b。 - 我想使用HTML :: TreeBuilder。我知道其他选择。


2 个答案:

答案 0 :(得分:1)

这是我认为你想要的粗略尝试。

它会抓取第一页上的所有链接并依次访问每个链接,在每个<span class="field-content">元素中打印链接。

use strict;
use warnings;
use 5.010;

use HTML::TreeBuilder;

STDOUT->autoflush;

my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $tree = HTML::TreeBuilder->new_from_url($url);

my @chem_links;

for my $div ( $tree->look_down( _tag => 'div', class => qr{\bmini-day-on\b} ) ) {
  my ($anchor)= $div->look_down(_tag => 'a', href => qr{http://www\.chemistry\.ucla\.edu});
  push @chem_links, $anchor->attr('href');
};

my @collector;

for my $url (@chem_links) {

  say $url;

  my $tree = HTML::TreeBuilder->new_from_url($url);

  my @seminars;

  for my $span ( $tree->look_down( _tag => 'span', class => 'field-content' ) ) {
    my ($anchor) = $span->look_down(_tag => 'a', href => qr{/});
    push @seminars, 'http://www.chemistry.ucla.edu'.$anchor->attr('href');
  }

  say "  $_" for @seminars;
  say '';

  push @collector, @seminars;
};

答案 1 :(得分:0)

对于更现代的解析网页框架,我建议您查看Mojo::UserAgentMojo::DOM。您无需手动遍历html树的每个部分,而是可以使用css selectors的强大功能来填写所需的特定数据。在Mojocast Episode 5的框架上有一个很好的8分钟介绍性视频。

# Parses the UCLA Chemistry Calendar and displays all seminar links

use strict;
use warnings;

use Mojo::UserAgent;
use URI;

my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';

my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;

for my $dayhref ($dom->find('div.mini-day-on > a[href*="/day/"]')->attr('href')->each) {
    my $dayurl = URI->new($dayhref)->abs($url);
    print $dayurl, "\n";

    my $daydom = $ua->get($dayurl->as_string)->res->dom;
    for my $seminarhref ($daydom->find('span.field-content > a[href]')->attr('href')->each) {
        my $seminarurl = URI->new($seminarhref)->abs($dayurl);
        print "  $seminarurl\n";
    }

    print "\n";
}

输出与使用Borodinsolution HTML::TreeBuilder的输出相同:

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-06
  http://www.chemistry.ucla.edu/seminars/nano-rheology-enzymes

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-09
  http://www.chemistry.ucla.edu/seminars/imaging-approach-biology-disease-through-chemistry

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-10
  http://www.chemistry.ucla.edu/seminars/arginine-methylation-%E2%80%93-substrates-binders-function
  http://www.chemistry.ucla.edu/seminars/special-inorganic-chemistry-seminar

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-13
  http://www.chemistry.ucla.edu/events/robert-l-scott-lecture-0

...