如何使用HTML :: TreeBuilder解析棘手的HTML文件

时间:2014-08-15 11:10:49

标签: perl html-parsing

假设我们有以下HTML文件:

TEST.HTM

<!DOCTYPE html>
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <b>weight:</b> 120kg<br>
    <b>length:</b> 10cm<br>
  </body>
</html>

如何从中获取以下数据?

{
'weight' => '120kg',
'length' => '10cm',
}

parser.pl

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new;
$root->parse_file('test.htm');

#what to do here?

$root->delete( );

2 个答案:

答案 0 :(得分:3)

这会让你非常接近你想要的东西(你需要调整你为键和值稍微获得的文本字符串)。

但我认为您会发现使用像Web:Scraper这样的工具要简单得多。

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;

use Data::Dumper;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new;
$root->parse_file(\*DATA);

my $data;

foreach my $elem ($root->find('b')) {
  $data->{($elem->content_list)[0]} = $elem->right;
}

say Dumper $data;

__END__
<!DOCTYPE html>
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <b>weight:</b> 120kg<br>
    <b>length:</b> 10cm<br>
  </body>
</html>

输出:

$VAR1 = {
          'length:' => ' 10cm',
          'weight:' => ' 120kg'
        };

答案 1 :(得分:1)

使用Mojo::DOM的两个解决方案:

use strict;
use warnings;

use Mojo::DOM;
use Data::Dump;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

my %hash = do {
    my $text = $dom->find('body')->all_text();
    split ' ', $text;
};

dd \%hash;

my %hash2 = map {
    $_->all_text() => $_->next_sibling() =~ s{^\s+|\s+$}{}gr
} $dom->find('b')->each;

dd \%hash2;

__DATA__
<!DOCTYPE html>
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <b>weight:</b> 120kg<br>
    <b>length:</b> 10cm<br>
  </body>
</html>

输出:

{ "length:" => "10cm", "weight:" => "120kg" }
{ "length:" => "10cm", "weight:" => "120kg" }