我想提取一个字符串或文件的标签之间的所有html我一直在使用(perl)和模块html :: parser,我认为这将是一个简单的任务,但它变得非常棘手?我找到了一些有效的代码,但不知道如何将结果保存到字符串?任何帮助赞赏 或者如果你能告诉我一些关于如何使用HTML :: TokeParser或类似方法获得这些代码的代码。
由于
my $content=<<EOF;
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Some title goes here</title>
</head>
<body bgcolor="#FFFFFF">
<div id="leftcol">
menu column
</div>
<div id="body">
<p>some text goes here some text goes here<br />
some text goes here some text goes here</p>
<p><strong>some header</strong></p>
<p>some text goes here some text goes here<br />
some text goes here some text goes here</p>
<p><img src="img.gif" /> image here</p>
<p><strong>some header</strong></p>
<p>some text goes here some text goes here<br />
some text goes here some text goes here</p>
</div>
<div id="rightcol">
news column
</div>
</body>
</html>
EOF
my $p = HTML::Parser->new( api_version => 3 );
$p->handler( start => \&start_handler, "self,tagname,attr" );
$p->parse($content);
exit;
sub start_handler {
my $self = shift;
my $tagname = shift;
my $attr = shift;
my $text = shift;
return unless ( $tagname eq 'body' );
$self->handler( start => sub { print shift }, "text" );
$self->handler( text => sub { print shift }, "text" );
$self->handler( end => sub {
my ($endtagname, $self, $text) = @_;
if($endtagname eq $tagname) {
$self->eof;
} else {
print $text;
}
}, "tagname,self,text");
}
如果我修改上面的子例程开始文本和结束处理程序如下
$self->handler( start => sub { my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( text => sub { my ($text) = @_; $inner_body = $inner_body. $text; }, "text" );
$self->handler( end => sub {
my ($endtagname, $self, $text) = @_;
if($endtagname eq $tagname) {
$self->eof;
} else {
$inner_body = $inner_body. $text;
}
}, "tagname,self,text");
}
要保存在varible中的所需输出
<div id="leftcol">
menu column
</div>
<div id="body">
<p>some text goes here some text goes here<br />
some text goes here some text goes here</p>
<p><strong>some header</strong></p>
<p>some text goes here some text goes here<br />
some text goes here some text goes here</p>
<p><img src="img.gif" /> image here</p>
<p><strong>some header</strong></p>
<p>some text goes here some text goes here<br />
some text goes here some text goes here</p>
</div>
<div id="rightcol">
news column
</div>
答案 0 :(得分:1)
您所要做的就是替换
print ...;
带
$inner_body .= ...;
就个人而言,我会使用XML :: LibXML。它可以处理HTML和XML(通过使用解析器的适当方法)。您拥有的是XHTML(与XML兼容),因此我们使用parse_string
代替parse_html_string
。
use XML::LibXML qw( );
use XML::LibXML::XPathContext qw( );
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(h => 'http://www.w3.org/1999/xhtml');
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($content);
my ($body_node) = $xpc->findnodes('/h:html/h:body', $doc)
or die;
my $inner_body = join '', map $_->toString(), $body_node->childNodes();
print $inner_body;