任务
用
替换任何标记内容中的所有空格。
y.html (示例文件)
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
我尝试了什么
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
open (my $fh, "<", "y.html") or die $!;
my $dom = Mojo::DOM->new(do{local $/ = undef; <$fh>});
$dom->find("*")->each( sub { $_->content( $_->content =~ s/\s/\ /gr ) } );
print $dom;
以上脚本的结果
<p class="MsoNormal" style="margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt"><span lang="nl-be" style="font-size:10.0pt; font-family:symbol;color:black;mso-ansi-language:nl-be">·<span class="grame"><span style="font-s
ize:7.0pt;color:black"> <span style="font-size:10.0pt;font-family:arial;color:black">Kit<span style="font-size:10.0pt;font-family:arial;color:black"> <span class="spelle"><i><span&nb
sp;style="font-size:10.0pt;font-family:arial">Strongyloides<i><span style="font-size:10.0pt;font-family:arial"> <span class="spelle">ratti<span style="font-size:10.0pt;font-family:arial"> (n
r. 9450) van <span class="spelle">Bordier Affinity Products. <span lang="nl-be" style="font-size:10.0pt;font-family: arial;mso-ansi-language:nl-be">Zie bijsluiter in bijlage: CLKB_B_030
6. Te bewaren bij 2 – 8 °C tot vervaldatum.<span lang="nl-be" style="mso-ansi-language: nl-be"><o:p></o:p></span lang="nl-be" style="mso-ansi-language: nl-be"></span lang
="nl-be" style="font-size:10.0pt;font-family: arial;mso-ansi-language:nl-be"></span class="spelle"></span style="font-size:10.0pt;font-family:arial"></span class="spelle"></span&nb
sp;style="font-size:10.0pt;font-family:arial"></i></span style="font-size:10.0pt;font-family:arial"></i></span class="spelle"></span style="font-size:10.0pt;font-family:arial;color:black"></
span style="font-size:10.0pt;font-family:arial;color:black"></span style="font-size:7.0pt;color:black"></span class="grame"></span lang="nl-be" style="font-size:10.0pt; font-f
amily:symbol;color:black;mso-ansi-language:nl-be"></p>
我没有得到所需的输出,它也在标签中添加
(例如:</span
),我希望仅在内容上完成。
PS:我用Mojo::DOM
试了一下,但没有必要使用它,你可以尝试任何其他解析器,但是我想知道我的代码有什么问题吗?
答案 0 :(得分:4)
这是一项标记输入的工作,使其更易于使用。因此,我建议使用HTML::TokeParser
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use HTML::TokeParser;
my $data = do {local $/; <DATA>};
my $p = HTML::TokeParser->new(\$data);
while (my $token = $p->get_token) {
if ($token->[0] eq 'T') {
my $text = $token->[1];
$text =~ s/ / /g;
print $text;
} else {
print "$token->[-1]";
}
}
__DATA__
<html>
<body>
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
</body>
</html>
输出:
<html>
<body>
<p class=MsoNormal style='margin-top:1.0pt;margin-right:0cm;margin-bottom:1.0pt;
margin-left:34.0pt;text-indent:-19.8pt'><span lang=NL-BE style='font-size:10.0pt;
font-family:Symbol;color:black;mso-ansi-language:NL-BE'>·</span><span
class=GramE><span style='font-size:7.0pt;color:black'>
</span><span style='font-size:10.0pt;font-family:Arial;color:black'>Kit</span></span><span
style='font-size:10.0pt;font-family:Arial;color:black'> </span><span
class=SpellE><i><span style='font-size:10.0pt;font-family:Arial'>Strongyloides</span></i></span><i><span
style='font-size:10.0pt;font-family:Arial'> <span class=SpellE>ratti</span></span></i><span
style='font-size:10.0pt;font-family:Arial'> (nr. 9450) van <span class=SpellE>Bordier</span>
Affinity Products. </span><span lang=NL-BE style='font-size:10.0pt;font-family:
Arial;mso-ansi-language:NL-BE'>Zie bijsluiter in bijlage: CLKB_B_0306. Te
bewaren bij 2 – 8 °C tot vervaldatum.</span><span lang=NL-BE style='mso-ansi-language:
NL-BE'><o:p></o:p></span></p>
</body>
</html>