假设我们有以下HTML文件:
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<b>weight:</b> 120kg<br>
<b>length:</b> 10cm<br>
</body>
</html>
如何从中获取以下数据?
{
'weight' => '120kg',
'length' => '10cm',
}
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file('test.htm');
#what to do here?
$root->delete( );
答案 0 :(得分:3)
这会让你非常接近你想要的东西(你需要调整你为键和值稍微获得的文本字符串)。
但我认为您会发现使用像Web:Scraper这样的工具要简单得多。
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file(\*DATA);
my $data;
foreach my $elem ($root->find('b')) {
$data->{($elem->content_list)[0]} = $elem->right;
}
say Dumper $data;
__END__
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<b>weight:</b> 120kg<br>
<b>length:</b> 10cm<br>
</body>
</html>
输出:
$VAR1 = {
'length:' => ' 10cm',
'weight:' => ' 120kg'
};
答案 1 :(得分:1)
使用Mojo::DOM
的两个解决方案:
use strict;
use warnings;
use Mojo::DOM;
use Data::Dump;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
my %hash = do {
my $text = $dom->find('body')->all_text();
split ' ', $text;
};
dd \%hash;
my %hash2 = map {
$_->all_text() => $_->next_sibling() =~ s{^\s+|\s+$}{}gr
} $dom->find('b')->each;
dd \%hash2;
__DATA__
<!DOCTYPE html>
<html>
<head>
<title>test</title>
</head>
<body>
<b>weight:</b> 120kg<br>
<b>length:</b> 10cm<br>
</body>
</html>
输出:
{ "length:" => "10cm", "weight:" => "120kg" }
{ "length:" => "10cm", "weight:" => "120kg" }