我有一个HTML内容我正在阅读Perl中的HTML并且只想抓住标签内的单词,即:
<span id="f002">From fairest creatures we desire increase,</span><br/>
<span id="f003">That thereby beauty’s rose might never die,</span><br/>
<span id="f004">But as the riper should by time decease,</span><br/>
<span id="f005">His tender heir might bear his memory:</span><br/>
<span id="f006">But thou contracted to thine own bright eyes,</span><br/>
<span id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
<span id="f008">Making a famine where abundance lies,</span><br/>
<span id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
<span id="f010">Thou that art now the world’s fresh ornament,</span><br/>
<span id="f011">And only herald to the gaudy spring,</span><br/>
<span id="f012">Within thine own bud buriest thy content,</span><br/>
<span id="f013">And tender churl mak’st waste in niggarding:</span><br/>
<span id="f014">Pity the world, or else this glutton be,</span><br/>
<span id="f015">To eat the world’s due, by the grave and thee.</span>
我想抓住span
标记内的每一个字,
我试过了:
([\w|’|-]+)([\W])
但是它将标签名称也作为单词匹配,请点击此处:https://regex101.com/r/mD3qG4/3 请建议一些正则表达式来实现这个目标
感谢
答案 0 :(得分:3)
从不使用正则表达式处理HTML,除非你绝对被迫,甚至可能不是。 CPAN上有几个完全可维护的HTML解析器,而HTML::TreeBuilder
对于此
这是一个按您的要求处理数据的程序。它会查找具有span
属性的所有id
元素,这些元素看起来像正则表达式模式f\d{3}
,并将其文本内容存储在数组@text
我必须在use utf8
位于顶部,因为__DATA__
部分中的文字包含一些非ASCII字符。如果您从外部文件中读取该内容,则无需
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf8) /;
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new;
$tree->parse_file(\*DATA);
my @text = map { $_->as_text } $tree->look_down( _tag => 'span', id => qr/^f\d{3}$/ );
print "$_\n" for @text;
__DATA__
<span id="f002">From fairest creatures we desire increase,</span><br/>
<span id="f003">That thereby beauty’s rose might never die,</span><br/>
<span id="f004">But as the riper should by time decease,</span><br/>
<span id="f005">His tender heir might bear his memory:</span><br/>
<span id="f006">But thou contracted to thine own bright eyes,</span><br/>
<span id="f007">Feed’st thy light’s flame with self-substantial fuel,</span><br/>
<span id="f008">Making a famine where abundance lies,</span><br/>
<span id="f009">Thy self thy foe, to thy sweet self too cruel:</span><br/>
<span id="f010">Thou that art now the world’s fresh ornament,</span><br/>
<span id="f011">And only herald to the gaudy spring,</span><br/>
<span id="f012">Within thine own bud buriest thy content,</span><br/>
<span id="f013">And tender churl mak’st waste in niggarding:</span><br/>
<span id="f014">Pity the world, or else this glutton be,</span><br/>
<span id="f015">To eat the world’s due, by the grave and thee.</span>
From fairest creatures we desire increase,
That thereby beauty’s rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed’st thy light’s flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world’s fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak’st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world’s due, by the grave and thee.