HTML解析:从内部标记获取内容

时间:2015-06-30 18:15:57

标签: html perl html-parsing

测试输入文件:

# cat test.html 
<div>line 1<div>Another 1</div></div>
<div>line 2<div>Another 2</div></div>
<div>line 3<div>Another 3</div></div>

预期输出:

Another 1
Another 2
Another 3

脚本:

#!/usr/bin/perl
use warnings;
use strict;
use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;

# $tree->ignore_ignorable_whitespace(0);
# $tree->no_space_compacting(1)

$tree->parse_file("test.html");

foreach my $a ($tree->find("div")) 
{
  print $a->as_text."\n";
}

脚本输出:

line 1Another 1
Another 1
line 2Another 2
Another 2
line 3Another 3
Another 3

问题: 我正在寻求帮助,仅从 内部 div标记中提取内容。我的脚本首先输出line 1Another 1,然后输出Another 1。但是,我只对Another 1感兴趣。

我尝试使用ignore_ignorable_whitespaceno_space_compacting(如脚本评论中所示),但它不起作用。要么我没有正确使用它,要么我正在咆哮错误的树。

1 个答案:

答案 0 :(得分:1)

You are finding all the div elements when you want just the inner ones. The findnodes method takes an XPath expression, so you can write

print $_->as_text, "\n" for $tree->findnodes('div/div')