Question

我有一堆html文件，我需要从中提取文本，但不是列表的内容。 html就像

<html>

    <Head>
        <title>intranet mycompany</title>
    </head>

    <body>
        <div>blah</div>
        <p>the text i need to extract
            <br>
            <ul>
                <li>stuff i don't want.</li>
                <li>more stuff i don't want.</li>
            </ul>More text i need to exctract.</p>
    </body>

</html>

我真的想要一些关于如何从段落中获取文本的建议，而不是列表中的文本。任何建议都会受到影响。

此致 Jombo。

Answer 1

use strictures;
use HTML::TreeBuilder::XPath qw();
my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<html> … </body>));
my ($ul) = $dom->findnodes('//ul');
$ul->delete;
my $extract = $dom->findvalue('//p');
# " the text i need to extract  More text i need to exctract. "

Answer 2

看一下CPAN的 HTML Parsers ，你会得到很好的解析器，比如HTML::TreeBuilder，HTML::Parser等。

Answer 3

这是一种摆脱<ul>数据的方法。由于HTML :: Parser在调用文本处理程序时不知道文档中的位置，因此您必须找到一些方法来为其提供该信息。

告诉每个起始元素调用的start_handler，以便记下一个开头<ul>并让end_handler删除该注释。然后，您可以利用text_handler中的信息，以便跳过<ul>s内的文字节点。

#!/usr/bin/perl -w
use strict;
use HTML::Parser;

my $text = '';
my $parser = HTML::Parser->new(
  start_h => [ \&start_handler, "self,tagname" ],
  end_h   => [ \&end_handler,   "self,tagname" ],
  text_h  => [ \&text_handler,  "self,dtext" ],
);

sub start_handler {
  my ($self, $tag) = @_;
  $self->{_private}->{'ul'} = 1 if ( $tag eq 'ul' ); # make a note
}

sub end_handler {
  my ($self, $tag) = @_;
  $self->{_private}->{'ul'} = 0 if ( $tag eq 'ul' ); # remove the note
}

sub text_handler {
  my ($self, $dtext) = @_;
  unless ($self->{_private}->{'ul'}) {
    # only if we're not inside the <ul>
    $text .= $dtext;
  }
}
$parser->parse_file('test.html');
print $text;

Answer 4

最难的是数据是多行的。如果你要将所有行加入一个大字符串，那么就像一个简单的正则表达式

s/<ul>.*<\/ul>//g

应该这样做。

Perl：从段落中删除列表<ul>。 HTML解析</ul>

4 个答案: