Question

我正在尝试将HTML网页中的文本和标签提取到文本文件中。

以下是输入网页内容（在视图中查看时：源模式）：

<div class="moduleBody">In addition, <b>ABC provides</b> dual finishing and detailing <u>products</u>, including a system of cleaners, dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics Business</p><p></p><p>The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its <b>product offerings</b> include personal protection products, such as <u>respiratory, hearing, eye and fall protection</u> equipment;<div class="moreLink">

以下代码可以单独提取文本，但它正在取消<p>，</p>，<u>，</u>，<b>和</b>以及其他HTML标记，我想保留它们。

use WWW::Mechanize;

use threads;

my $mech = WWW::Mechanize->new;

my $Lvalue = "";

$mech->get($link);
$mech->quiet(1);

my $p = HTML::TokeParser->new(\$mech->content);

while ( my $tag1 = $p->get_tag('div') ) {

    if ( $tag1->[1]{class} and $tag1->[1]{class} eq 'moduleBody' ) {

        $Lvalue = $p->get_trimmed_text("moreLink");
        $Lvalue =~ s/$find1/|/g;
        $Lvalue =~ s/$find2/|/g;

        print $fh "$ticker^|$Lvalue\n";
    }
}

以上代码的输出是：

In addition, ABC provides dual finishing and detailing products, including a system of cleaners, dressings, polishes, waxes and other products. Safety and Graphics Business The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its product offerings include personal protection products, such as respiratory, hearing, eye and fall protection equipment;

实际上，我的代码正在删除我想要保留的HTML标记。我觉得可能需要调整“get_trimmed_text”来保留p，/ p，b和/ b（以及其他html）标签。有人可以帮助您对代码进行任何必要的更改吗？

明确说明要求： 我正在寻找一个perl函数，它可以帮助提取网页上“<div class="moduleBody">”和“<div class="moreLink">”之间的（TEXT + ALL HTML TAGS）（如上面的示例输入文本中所引用）。除了get_trimmed_text之外，我还可以使用其他功能。

非常感谢。

回答这个问题 - 对于普通观众来说 @SinanÜnür提供的回复效果很好。谢谢@SinanÜnür！ +1并将其标记为答案。为了普通观众的利益，请注意，只要您将HTML内容保留在“my $html = <<HTML;”变量中，SinanÜnür的代码就能正常运行。如果您正在阅读URL，则需要对代码进行一些调整以包含以下内容：

use LWP::Simple;
my $url = "http://www.example.com/profile?item=66&class=XYZ";
my $html = get($url);

Answer 1

在对问题进行更新后回答更新。

我正在寻找一个perl函数，它可以帮助提取位于＆＃34; <div class="moduleBody">＆＃34;之间的（TEXT + ALL HTML TAGS）。和＆＃34; <div class="moreLink">＆＃34;在网页上（如上面的示例输入文本中所引用）。

HTML::TokeParser是一个流解析器：你要求令牌或标签（这是特定种类的令牌。所以，使用这个模块，你会要求解析器找到下一个div，检查是否它是正确的类，如果是，则开始累积所有后续标记的内容，直到<div class="moreLink">开始标记。

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML;
<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;<div
class="moreLink">
HTML

my $p = HTML::TokeParser::Simple->new(\$html);
my $start = { tag => 'div', class => 'moduleBody' };
my $end = { tag => 'div', class => 'moreLink' };

while ( defined(my $chunk = extract_html_between($p, $start, $end)) ) {
    print "[[[$chunk]]]\n"
}

sub extract_html_between {
    my $p = shift;
    my $start = shift;
    my $end = shift;

    my $chunk;
    while (my $tag = $p->get_tag($start->{tag})) {
        my $class = $tag->get_attr('class');
        next unless $class and $class eq $start->{class};

        $chunk = $tag->as_is; # only if you want the opening div
        CHUNK:
        while (my $token = $p->get_token) {
            if ( $token->is_start_tag($end->{tag}) ) {
                $class = $token->get_attr('class');
                last CHUNK if $class and $class eq $end->{class};
            }
            $chunk .= $token->as_is;
        }
    }

    return $chunk;
}

输出：

[[[<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;]]]

Answer 2

这是非常奇怪的代码。除了提取网页之外，您没有使用WWW::Mechanize，因此您也可以直接使用LWP::UserAgent。此外，如果您想要提取HTML资源的parr并打印它，HTML::TokeParser不是正确的工具

您似乎甚至没有阅读过该文档，因为$p->get_trimmed_text("moreLink")将返回所有文本，直到第一次出现<moreLink>元素，这不是有效的HTML标记。您拥有的是您刚刚找到的class元素的div属性的值

我会为此选择Mojolicious，因为它将获取页面，构建DOM，并对您指定的元素进行字符串化，而无需任何其他模块

我写过这个，但我目前无法测试它

use strict;
use warnings 'all';

use Mojo::UserAgent;

use constant URL => 'http://example.com/';

my $ua = Mojo::UserAgent->new;

my $txn = $ua->get(URL);

if ( my $err = $txn->error ) {
    die "@{$err}{qw/ code message /};
}

print $txn->res->dom->at('div.moduleBody')->to_string;

在HTML文件

2 个答案: