使用perl在标签之间提取html

时间:2013-05-28 18:06:10

标签: html perl html-parsing

我想提取一个字符串或文件的标签之间的所有html我一直在使用(perl)和模块html :: parser,我认为这将是一个简单的任务,但它变得非常棘手?我找到了一些有效的代码,但不知道如何将结果保存到字符串?任何帮助赞赏 或者如果你能告诉我一些关于如何使用HTML :: TokeParser或类似方法获得这些代码的代码。

由于

my $content=<<EOF;
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <title>Some title goes here</title>
 </head>
 <body bgcolor="#FFFFFF">
   <div id="leftcol">
     menu column
  </div>
  <div id="body">
   <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
   </div>
    <div id="rightcol">
   news column
    </div>
 </body>
</html>
EOF


my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => \&start_handler, "self,tagname,attr" );
$p->parse($content);
exit;

sub start_handler {
    my $self = shift;
    my $tagname  = shift;
    my $attr = shift;
    my $text = shift;
    return unless ( $tagname eq 'body' );
    $self->handler( start => sub { print shift }, "text" );
    $self->handler( text =>  sub { print shift }, "text" );
    $self->handler( end  => sub {
    my ($endtagname, $self, $text) = @_;
         if($endtagname eq $tagname) {
         $self->eof;
         } else {
              print $text;
        }
    }, "tagname,self,text");
 }

如果我修改上面的子例程开始文本和结束处理程序如下

为什么这些变量中的文本没有保存在我的?

$self->handler( start => sub {  my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( text =>  sub {  my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( end  => sub {
       my ($endtagname, $self, $text) = @_;
       if($endtagname eq $tagname) {
            $self->eof;
           } else {
             $inner_body = $inner_body. $text;
           }
        }, "tagname,self,text");

}

print $ inner_body; #&lt; - 打印空白???

要保存在varible中的所需输出


   <div id="leftcol">
     menu column
  </div>
  <div id="body">
   <p>some text goes here some text goes here<br />
    some text goes here some text goes here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
    <p><img src="img.gif" /> image here</p>
   <p><strong>some header</strong></p>
   <p>some text goes here some text goes here<br />
   some text goes here some text goes here</p>
   </div>
    <div id="rightcol">
   news column
    </div>

1 个答案:

答案 0 :(得分:1)

您所要做的就是替换

print ...;

$inner_body .= ...;

就个人而言,我会使用XML :: LibXML。它可以处理HTML和XML(通过使用解析器的适当方法)。您拥有的是XHTML(与XML兼容),因此我们使用parse_string代替parse_html_string

use XML::LibXML               qw( );
use XML::LibXML::XPathContext qw( );

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(h => 'http://www.w3.org/1999/xhtml');

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($content);
my ($body_node) = $xpc->findnodes('/h:html/h:body', $doc)
   or die;

my $inner_body = join '', map $_->toString(), $body_node->childNodes();
print $inner_body;