我需要扫描html片段,查找文本中的某些字符串(不在元素属性中),并用<span></span>
包装这些匹配的字符串。以下是输出的示例尝试:
use v5.10;
use Mojo::DOM;
my $body = qq|
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not <a href="/Cool.html">here</a>.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
|;
my $dom = Mojo::DOM->new($body);
foreach my $e ($dom->find('*')->each) {
my $text = $e->text;
say "e text is: $text ";
if ($text =~ /Cool/) {
(my $newtext = $text ) =~ s/Cool/<span class="fun">Cool<\/span>/g;
$e->replace_content($newtext);
}
}
say $dom->root;
输出:
e text is:
e text is: Boring Text:
e text is: Highlight Cool whenever we see it. but not. And here is more Cool.
e text is: here
e text is: sub Cool { print "Foo "; }
<div>
<p>Boring Text:</p>
<p>Highlight <span class="fun">Cool</span> whenever we see it. but not. And here is more <span class="fun">Cool</span>.</p>
</div>
关闭,但我真正希望看到的内容如下:
<div>
<p>Boring Text:</p>
<p>Highlight <span class="fun">Cool</span> whenever we see it. but not <a href="/Cool.html">here</a>.
<code>
sub <span class="fun">Cool<span> {
print "Foo\n";
}
</code>
And here is more <span class="fun">Cool</span>.</p>
</div>
任何帮助/指针将不胜感激。 谢谢, 托德
答案 0 :(得分:1)
以下是使用XML::Twig
的开始。一个问题是<code>
标记内的文字换行符。我想解析器无法看到它和普通的解析器之间的区别。将它编码为

或使用CDATA
部分可能会有所帮助。否则我不知道如何处理它:
script.pl
的内容:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my $body = qq|
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not <a href="/Cool.html">here</a>.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
|;
XML::Twig::Elt::set_replaced_ents(q{});
my $elt = XML::Twig::Elt->new( 'span' => { class => 'fun' }, 'Cool' );
my $twig = XML::Twig->new( pretty_print => 'nice' )->parse( $body );
$twig->subs_text( 'Cool', $elt->sprint );
$twig->print;
像以下一样运行:
perl script.pl
它产生:
<div>
<p>Boring Text:</p>
<p>
Highlight <span class="fun">Cool</span>
whenever we see it.
but not <a href="/Cool.html">here</a>.
<code>
sub <span class="fun">Cool</span>
{
print "Foo
";
}
</code>
And here is more <span class="fun">Cool</span>
.
</p>
</div>
答案 1 :(得分:1)
调查XML::Twig
我不太确定这是正确的工具。令人惊讶的是这样一个简单的任务是多么尴尬。
这是一个使用HTML::TreeBuilder
的工作程序。不幸的是它没有生成格式化的输出,所以我自己添加了一些空格。
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = HTML::TreeBuilder->new_from_content(<<__HTML__);
<div>
<p>Boring Text:</p>
<p>
Highlight Cool whenever we see it.
but not <a href="/Cool.html">here</a>.
<code>
sub Cool {
print "Foo\n";
}
</code>
And here is more Cool.
</p>
</div>
__HTML__
$html->objectify_text;
for my $text_node ($html->look_down(_tag => '~text')) {
my $text = $text_node->attr('text');
if (my @replacement = process_text($text)) {
my $old_node = $text_node->replace_with(@replacement);
$old_node->delete;
}
}
$html->deobjectify_text;
print $html->guts->as_XML;
sub process_text {
my @nodes = split /\bCool\b/, shift;
return unless @nodes > 1;
my $span = HTML::Element->new('span', class => 'fun');
$span->push_content('Cool');
for (my $i = 1; $i < @nodes; $i += 2) {
splice @nodes, $i, 0, $span->clone;
}
$span->delete;
@nodes;
}
<强>输出强>
<div>
<p>Boring Text:</p>
<p>
Highlight <span class="fun">Cool</span> whenever we see it.
but not <a href="/Cool.html">here</a>.
<code> sub <span class="fun">Cool</span> { print "Foo "; } </code>
And here is more <span class="fun">Cool</span>.
</p>
</div>