Question

我使用以下代码从目录中的txt文件中删除html元素：

use strict;
use warnings;

use File::Spec;
use HTML::FormatText;
 use Cwd;

my $direct = "/directory/";

opendir my $dh, $direct or die "Can't open directory";

while ( readdir $dh ) {

  next if /^\./;

  my $file = File::Spec->catfile($direct, $_);
  print $file."\n";
  my $outfile = File::Spec->catfile($direct, "out_$_");
  next unless -f $file;

  my $html = do {
    open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
    local $/;
    <$fh>;
  };

  next unless $html =~ /<html/i;

  my $formatted = HTML::FormatText->format_string(
      $html, leftmargin => 0, rightmargin => 60);

  open my $fh, '>', $outfile or die qq(Unable to open "$outfile" for writing: $!);

  print $fh "File: $file\n\n";
  print $fh "$formatted\n";
  print $fh "*" x 40, "\n" ;

  close $fh or die qq(Unable to close "$outfile" after writing: $!);
   unlink $file or warn "Could not unlink $file: $!";
}

但似乎在结果输出中留下了许多不需要的字符：

&lt;div style="text-align:center;"&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;"&gt;TEXT TEXT TEXT TEXT&lt;/font&gt;&lt;/div&gt;&lt;div style="text-align:center;"&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;"&gt;TEXT TEXT TEXT TEXT&lt;/font&gt;&lt;/div&gt;&lt;div style="text-align:center;"&gt;&amp;#160;&lt;/div&gt;&lt;p style='margin-top:0pt; margin-bottom:0pt'&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;margin-left:0px;"&gt;1&lt;/font&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;"&gt;.  &lt;/font&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;"&gt;ORGANIZATION &lt;/font&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;"&gt;AND&lt;/font&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;"&gt; SUMMARY OF &lt;/font&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-decoration:underline;"&gt;SIGNIFICANT ACCOUNTING &lt;/font&gt;&lt;font style="font-family:Times New Roman;font-size:11pt;font-weight:bold;text-

知道如何摆脱这些HTML / CSS吗？（但保留这些标签内的文字）！

Answer 1

HTML :: Parser发行版包含an example program，用于从HTML文件中提取纯文本。

#!/usr/bin/perl -w

# Extract all plain text from an HTML file

use strict;
use HTML::Parser 3.00 ();

my %inside;

sub tag
{
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text
{
    return if $inside{script} || $inside{style};
    print $_[0];
}

HTML::Parser->new(api_version => 3,
          handlers    => [start => [\&tag, "tagname, '+1'"],
                  end   => [\&tag, "tagname, '-1'"],
                  text  => [\&text, "dtext"],
                 ],
          marked_sections => 1,
    )->parse_file(shift) || die "Can't open file: $!\n";

Answer 2

如果你安装了Mojolicious，那么就像：

perl -MMojo::DOM -0 -e 'print my $dom = Mojo::DOM->new(<>)->all_text()' file.html

可能会起作用： - ）

说明人： Mojo::DOM->new(<>)->all_text()应该是自我解释的;-) ... <>只是在STDIN上提供的DOM对象并且->all_text()在该对象上运行all_text方法。

-0切换请参见perlun。基本上它是用于诽谤文件，以便<>包含整个事物（错误......有人会在评论中纠正我）。您可以使用Mojo::DOM创建一个真实的脚本，就像Dave的答案一样，而不仅仅是我的例子中的hackish oneliner。

Perl - 从文本文件中删除CSS（和其他不需要的字符）

2 个答案: