民间,
HTML :: Treebuilder上有很多信息我很惊讶我找不到答案,希望我不会错过它。
我想要做的只是在父节点之间进行解析,所以给出一个像这样的html文档
<html>
<body>
<a id="111" name="111"></a>
<p>something</p>
<p>something</p>
<p>something</p>
<a href=xxx">something</a>
<a id="222" name="222"></a>
<p>something</p>
<p>something</p>
<p>something</p>
....
</body>
</html>
我希望能够获得有关第一个锚标记(111)的信息,然后处理3个p标记,然后获取下一个锚标记(222),然后处理那些p标记等等。
很容易找到每个锚标记
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file("index-01.htm");
foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) {
if ($atag->attr('id')) {
# Found 'a' tag, now process the p tags until the next 'a'
}
}
但是一旦我找到该标签,我如何获得所有p标签,直到下一个锚点?
TIA !!
答案 0 :(得分:4)
#!/usr/bin/perl
use strict; use warnings;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file(\*DATA);
$tree->elementify;
$tree->objectify_text;
foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) {
if ($atag->attr('id')) {
printf "Found %s\n", $atag->as_XML;
process_p( $atag );
}
}
sub process_p {
my ($tag) = @_;
while ( defined( $tag ) and defined( my $next = $tag->right ) ) {
last if lc $next->tag eq 'a';
if ( lc $next->tag eq 'p') {
$next->deobjectify_text;
print $next->as_text, "\n";
}
$tag = $next;
}
}
__DATA__
<html>
<body>
<a id="111" name="111"></a>
<p>something</p>
<p>something</p>
<p>something</p>sometext
<a href=xxx">something</a>
<a id="222" name="222"></a>
<p>something</p>
<p>something</p>
<p>something</p>
</body>
</html>
输出:
Found <a id="111" name="111"></a>
something
something
something
Found <a id="222" name="222"></a>
something
something
something
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
while ( my $tag = $parser->get_tag('a') ) {
next unless $tag->get_attr('id');
printf "Found %s\n", $tag->as_is;
process_p($parser);
}
sub process_p {
my ($parser) = @_;
while ( my $next = $parser->get_token ) {
if ( $next->is_start_tag('a') ) {
$parser->unget_token($next);
return;
}
elsif ( $next->is_start_tag('p') ) {
print $parser->get_text('/p'), "\n";
}
}
return;
}
输出:
Found <a id="111" name="111">
something
something
something
Found <a id="222" name="222">
something
something
something