使用Perl仅匹配HTML标记内的单词

时间:2016-07-15 12:55:39

标签: html perl

我有一个HTML内容我正在阅读Perl中的HTML并且只想抓住标签内的单词,即:

<span id="f002">From fairest creatures we desire increase,</span><br/>
<span id="f003">That thereby beauty’s rose might never die,</span><br/>
<span id="f004">But as the riper should by time decease,</span><br/>
<span id="f005">His tender heir might bear his memory:</span><br/>
<span id="f006">But thou contracted to thine own bright eyes,</span><br/>
<span id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
<span id="f008">Making a famine where abundance lies,</span><br/>
<span id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
<span id="f010">Thou that art now the world’s fresh ornament,</span><br/>
<span id="f011">And only herald to the gaudy spring,</span><br/>
<span id="f012">Within thine own bud buriest thy content,</span><br/>
<span id="f013">And tender churl mak’st waste in niggarding:</span><br/>
<span id="f014">Pity the world, or else this glutton be,</span><br/>
<span id="f015">To eat the world’s due, by the grave and thee.</span>

我想抓住span标记内的每一个字,

我试过了:

([\w|’|-]+)([\W])

但是它将标签名称也作为单词匹配,请点击此处:https://regex101.com/r/mD3qG4/3 请建议一些正则表达式来实现这个目标

感谢

1 个答案:

答案 0 :(得分:3)

从不使用正则表达式处理HTML,除非你绝对被迫,甚至可能不是。 CPAN上有几个完全可维护的HTML解析器,而HTML::TreeBuilder对于此

来说已经足够了

这是一个按您的要求处理数据的程序。它会查找具有span属性的所有id元素,这些元素看起来像正则表达式模式f\d{3},并将其文本内容存储在数组@text

我必须在use utf8位于顶部,因为__DATA__部分中的文字包含一些非ASCII字符。如果您从外部文件中读取该内容,则无需

use utf8;
use strict;
use warnings 'all';

use open qw/ :std :encoding(utf8) /;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new;
$tree->parse_file(\*DATA);

my @text = map { $_->as_text } $tree->look_down( _tag => 'span', id => qr/^f\d{3}$/ );

print "$_\n" for @text;

__DATA__
<span id="f002">From fairest creatures we desire increase,</span><br/>
<span id="f003">That thereby beauty’s rose might never die,</span><br/>
<span id="f004">But as the riper should by time decease,</span><br/>
<span id="f005">His tender heir might bear his memory:</span><br/>
<span id="f006">But thou contracted to thine own bright eyes,</span><br/>
<span id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
<span id="f008">Making a famine where abundance lies,</span><br/>
<span id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
<span id="f010">Thou that art now the world’s fresh ornament,</span><br/>
<span id="f011">And only herald to the gaudy spring,</span><br/>
<span id="f012">Within thine own bud buriest thy content,</span><br/>
<span id="f013">And tender churl mak’st waste in niggarding:</span><br/>
<span id="f014">Pity the world, or else this glutton be,</span><br/>
<span id="f015">To eat the world’s due, by the grave and thee.</span>

输出

From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed’st thy light’s flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world’s fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak’st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world’s due, by the grave and thee.