将相邻元素与Perl的Web :: Scraper相关联

时间:2014-02-21 04:54:55

标签: perl web-scraping

以下是手头的例子:

#!/usr/bin/perl
use strict;
use Web::Scraper;
use Data::Dumper;

my $html = q[
<html>
  <body>
    <div class="mainContainer">
      <div class="when">February 20, 2014</div>
      <div class="name">Name 1</div>
      <div class="desc">Desc 1</div>
      <div class="when">February 21, 2014</div>
      <div class="name">Name 2</div>
      <div class="desc">Desc 2</div>
      <div class="name">Name 3</div>
      <div class="desc">Desc 3</div>
      <div class="when">February 22, 2014</div>
      <div class="name">Name 4</div>
      <div class="desc">Desc 4</div>
    </div>
  </body>
</html>
];

my $scraper = scraper {
    process ".when", "events[]" => scraper {
      my $when = $_->content();
      my $hash = {};
      $hash->{$when}->{name} = "NAME";
      $hash->{$when}->{desc} = "DESC";
      return $hash;
    };
};

my $result = $scraper->scrape($html);

print Dumper( $result );

我要做的是将日期与事件详细信息相关联。正如你所看到的,div不是嵌套的,所以它不是那么简单(至少对我而言)。此外,每个活动都由namedesc组成。我没有找到一种方法来使用css选择器将所需结构中的相邻元素关联起来。我想我需要一个自定义子程序来返回做元素的关联。我想要检索的内容类似于以下内容:

[
 'February 20, 2014' => [
     {
     'name' => 'Name 1',
     'desc' => 'Desc 1'
     }
 ],
 'February 21, 2014' => [
     {
     'name' => 'Name 2',
     'desc' => 'Desc 2'
     },
     {
     'name' => 'Name 3',
     'desc' => 'Desc 3'
     }
 ],
 'February 22, 2014' => [
     {
     'name' => 'Name 4',
     'desc' => 'Desc 4'
     }
 ]
]

1 个答案:

答案 0 :(得分:0)

首先获取数据然后在刮刀之后处理这些数据可能会更好。所以...:

my $scraper = scraper {
  process ".when", "dates[]" => "TEXT";
  process ".name", "names[]" => "TEXT";
  process ".desc", "desc[]" => "TEXT";
};

my $result = $scraper->scrape($html);

# Here you would start processing these

my @dates = @{ $result->{dates} };
my @names = @{ $result->{names} };
my @info = @{ $result->{desc} };
my %events;

for ( my $i = 0; $i < scalar @dates; $i++ ) {
  my $date = $dates[$i];
  my $name = $names[$i];
  my $info = $info[$i];
  if ( exists $events{$date} ) {
    push @{ $events{$date} }, { 'name' => $name, 'desc' => $info };
  }
  else {
    $events{$date} = [{ 'name' => $name, 'desc' => $info}];
  }
}

%事件将拥有您需要的数据。这是假设您仍然需要这个,并且每个事件日期后面都有一个名称和描述。另外,我还没有测试过这个。