到目前为止,我一直在使用perl通过HTML::TreeBuilder
从网页获取数据。当数据包含在meta
或div
标记中时,这没问题;但是现在我偶然发现了一个我不知道如何爬网的新结构,尽管看起来很琐碎。
<html lang="en">
<body>
<script type="text/javascript">
panel.web.bootstrapData = {
"data": {
"units": "kW",
"horsePower": 100.00
}
};
</script>
</body>
</html>
该示例显示了我从网络上获得的内容的相关部分。我想获取units
和horsePower
的值。
到目前为止我使用的代码片段:
use strict;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder;
[...]
$reply = $ua->get($url, @ns_headers);
# printing the reply would get us the first code snippet.
print $reply->content;
unless ($reply->is_success) {
[...]
}
my $tree = HTML::TreeBuilder->new_from_content($reply->content);
my @unit_array = $tree -> look_down(_tag=>'meta','itemprop'=>'unit');
my $unit = $unit_array[0]->attr('content');
[...]
任何人都知道如何获取相关数据,并且我是否应该为此使用HTML::TreeBuilder
以外的其他内容?我没有发现通过stackoverflow和网络进行搜索的类似案例。
答案 0 :(得分:1)
您基本上在正确的道路上。但是HTML::TreeBuilder对JavaScript一无所知。
方法:
<script>
个节点<script>
内容,则需要更多技巧\;
,但是如果没有它,SO语法突出显示就会变得混乱没有错误检查的第一个粗略解决方案。我在代码中留下了一些调试行,注释掉了,以便您可以跟踪每个步骤在做什么:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use HTML::TreeBuilder;
use JSON;
my $decoder = new JSON;
my $tree = HTML::TreeBuilder->new_from_file(\*DATA);
#$tree->dump;
my @scripts = $tree->look_down(_tag => 'script');
#$scripts[0]->dump;
# NOTE 1: ->as_text() *DOES NOT* return <script> content!
# NOTE 2: ->as_HTML() probably doesn't work for all cases, i.e. escaping
my $javascript = ($scripts[0]->content_list())[0];
#print "${javascript}\n";
my($json) = $javascript =~ /(\{.+\})\;/s;
#print "${json}\n";
my $object = $decoder->decode($json);
print Dumper($object);
print "FOUND: units: ", $object->{data}->{units},
" horsepower: ", $object->{data}->{horsePower}, "\n";
# IMPORTANT: $tree needs to be destroyed by hand when you're done with it!
$tree->delete;
exit 0;
__DATA__
<html lang="en">
<body>
<script type="text/javascript">
panel.web.bootstrapData = {
"data": {
"units": "kW",
"horsePower": 100.00
}
};
</script>
</body>
</html>
试运行:
$ perl dummy.pl
$VAR1 = {
'data' => {
'horsePower' => '100',
'units' => 'kW'
}
};
FOUND: units: kW horsepower: 100