我使用以下代码从目录中的txt文件中删除html元素:
use strict;
use warnings;
use File::Spec;
use HTML::FormatText;
use Cwd;
my $direct = "/directory/";
opendir my $dh, $direct or die "Can't open directory";
while ( readdir $dh ) {
next if /^\./;
my $file = File::Spec->catfile($direct, $_);
print $file."\n";
my $outfile = File::Spec->catfile($direct, "out_$_");
next unless -f $file;
my $html = do {
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
local $/;
<$fh>;
};
next unless $html =~ /<html/i;
my $formatted = HTML::FormatText->format_string(
$html, leftmargin => 0, rightmargin => 60);
open my $fh, '>', $outfile or die qq(Unable to open "$outfile" for writing: $!);
print $fh "File: $file\n\n";
print $fh "$formatted\n";
print $fh "*" x 40, "\n" ;
close $fh or die qq(Unable to close "$outfile" after writing: $!);
unlink $file or warn "Could not unlink $file: $!";
}
但似乎在结果输出中留下了许多不需要的字符:
<div style="text-align:center;"><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;">TEXT TEXT TEXT TEXT</font></div><div style="text-align:center;"><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;">TEXT TEXT TEXT TEXT</font></div><div style="text-align:center;">&#160;</div><p style='margin-top:0pt; margin-bottom:0pt'><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;">1</font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;">. </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;">ORGANIZATION </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;">AND</font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;"> SUMMARY OF </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;">SIGNIFICANT ACCOUNTING </font><font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-
知道如何摆脱这些HTML / CSS吗? (但保留这些标签内的文字)!
答案 0 :(得分:3)
HTML :: Parser发行版包含an example program,用于从HTML文件中提取纯文本。
#!/usr/bin/perl -w
# Extract all plain text from an HTML file
use strict;
use HTML::Parser 3.00 ();
my %inside;
sub tag
{
my($tag, $num) = @_;
$inside{$tag} += $num;
print " "; # not for all tags
}
sub text
{
return if $inside{script} || $inside{style};
print $_[0];
}
HTML::Parser->new(api_version => 3,
handlers => [start => [\&tag, "tagname, '+1'"],
end => [\&tag, "tagname, '-1'"],
text => [\&text, "dtext"],
],
marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";
答案 1 :(得分:0)
如果你安装了Mojolicious
,那么就像:
perl -MMojo::DOM -0 -e 'print my $dom = Mojo::DOM->new(<>)->all_text()' file.html
可能会起作用: - )
说明人: Mojo::DOM->new(<>)->all_text()
应该是自我解释的;-) ... <>
只是在STDIN
上提供的DOM对象并且->all_text()
在该对象上运行all_text
方法。
-0
切换请参见perlun
。基本上它是用于诽谤文件,以便<>
包含整个事物(错误......有人会在评论中纠正我)。您可以使用Mojo::DOM
创建一个真实的脚本,就像Dave的答案一样,而不仅仅是我的例子中的hackish oneliner。