Question

可能重复：
Which CPAN module would you recommend for turning HTML into plain text?

问题：

是否有呈现HTML 的模块，专门用于收集文本，同时遵守字体样式标记，例如<tt>，{{1 }}，<b>等断行 <i>，类似于Lynx。

例如：

<br>

# cat test.html

<body> <div id="foo" class="blah"> <tt>test<br> <b>test</b><br> whatever<br> test</tt> </div> </body>

# lynx.exe --dump test.html

注意：第二行应为粗体。

Answer 1

Lynx是一个很棒的程序，它的html呈现非常简单。

这个怎么样：

my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;

Answer 2

转到search.cpan.org并搜索HTML text，它会为您提供许多选项以满足您的特定需求。 HTML::FormatText是一个很好的基线，然后分支到它的特定变体，例如HTML::FormatText::WithLinks如果你想将链接保存为脚注。

Answer 3

我在Windows上，因此我无法对此进行全面测试，但您可以调整htext附带的HTML::Parser：

#!/usr/bin/perl

use strict; use warnings;

use HTML::Parser;
use Term::ANSIColor;

use HTML::Parser 3.00 ();

my %inside;

sub tag {
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text {
    return if $inside{script} || $inside{style};
    my $esc = 1;
    if ( $inside{b} or $inside{strong} ) {
        print color 'blue';
    }
    elsif ( $inside{i} or $inside{em} ) {
        print color 'yellow';
    }
    else {
        $esc = 0;
    }
    print $_[0];
    print color 'reset' if $esc;
}

HTML::Parser->new(api_version => 3,
    handlers => [
        start => [\&tag, "tagname, '+1'"],
        end   => [\&tag, "tagname, '-1'"],
        text  => [\&text, "dtext"],
    ],
    marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;

如何像Lynx一样使用Perl将HTML呈现为文本？

问题：

3 个答案: