在HTML文件

时间:2017-04-30 18:36:53

标签: perl

我正在尝试将HTML网页中的文本和标签提取到文本文件中。

以下是输入网页内容(在视图中查看时:源模式):

<div class="moduleBody">In addition, <b>ABC provides</b> dual finishing and detailing <u>products</u>, including a system of cleaners, dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics Business</p><p></p><p>The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its <b>product offerings</b> include personal protection products, such as <u>respiratory, hearing, eye and fall protection</u> equipment;<div class="moreLink">

以下代码可以单独提取文本,但它正在取消<p></p><u></u><b></b>以及其他HTML标记,我想保留它们。

use WWW::Mechanize;

use threads;

my $mech = WWW::Mechanize->new;

my $Lvalue = "";

$mech->get($link);
$mech->quiet(1);

my $p = HTML::TokeParser->new(\$mech->content);

while ( my $tag1 = $p->get_tag('div') ) {

    if ( $tag1->[1]{class} and $tag1->[1]{class} eq 'moduleBody' ) {

        $Lvalue = $p->get_trimmed_text("moreLink");
        $Lvalue =~ s/$find1/|/g;
        $Lvalue =~ s/$find2/|/g;

        print $fh "$ticker^|$Lvalue\n";
    }
}

以上代码的输出是:

In addition, ABC provides dual finishing and detailing products, including a system of cleaners, dressings, polishes, waxes and other products. Safety and Graphics Business The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its product offerings include personal protection products, such as respiratory, hearing, eye and fall protection equipment;

实际上,我的代码正在删除我想要保留的HTML标记。 我觉得可能需要调整“get_trimmed_text”来保留p,/ p,b和/ b(以及其他html)标签。 有人可以帮助您对代码进行任何必要的更改吗?

明确说明要求: 我正在寻找一个perl函数,它可以帮助提取网页上“<div class="moduleBody">”和“<div class="moreLink">”之间的(TEXT + ALL HTML TAGS)(如上面的示例输入文本中所引用) 。除了get_trimmed_text之外,我还可以使用其他功能。

非常感谢。

回答这个问题 - 对于普通观众来说 @SinanÜnür提供的回复效果很好。谢谢@SinanÜnür! +1并将其标记为答案。 为了普通观众的利益,请注意,只要您将HTML内容保留在“my $html = <<HTML;”变量中,SinanÜnür的代码就能正常运行。如果您正在阅读URL,则需要对代码进行一些调整以包含以下内容:

use LWP::Simple;
my $url = "http://www.example.com/profile?item=66&class=XYZ";
my $html = get($url);

2 个答案:

答案 0 :(得分:1)

在对问题进行更新后回答更新。

  

我正在寻找一个perl函数,它可以帮助提取位于&#34; <div class="moduleBody">&#34;之间的(TEXT + ALL HTML TAGS)。和&#34; <div class="moreLink">&#34;在网页上(如上面的示例输入文本中所引用)。

HTML::TokeParser是一个流解析器:你要求令牌或标签(这是特定种类的令牌。所以,使用这个模块,你会要求解析器找到下一个div,检查是否它是正确的类,如果是,则开始累积所有后续标记的内容,直到<div class="moreLink">开始标记。

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<HTML;
<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;<div
class="moreLink">
HTML

my $p = HTML::TokeParser::Simple->new(\$html);
my $start = { tag => 'div', class => 'moduleBody' };
my $end = { tag => 'div', class => 'moreLink' };

while ( defined(my $chunk = extract_html_between($p, $start, $end)) ) {
    print "[[[$chunk]]]\n"
}

sub extract_html_between {
    my $p = shift;
    my $start = shift;
    my $end = shift;

    my $chunk;
    while (my $tag = $p->get_tag($start->{tag})) {
        my $class = $tag->get_attr('class');
        next unless $class and $class eq $start->{class};

        $chunk = $tag->as_is; # only if you want the opening div
        CHUNK:
        while (my $token = $p->get_token) {
            if ( $token->is_start_tag($end->{tag}) ) {
                $class = $token->get_attr('class');
                last CHUNK if $class and $class eq $end->{class};
            }
            $chunk .= $token->as_is;
        }
    }

    return $chunk;
}

输出:

[[[<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;]]]

答案 1 :(得分:1)

这是非常奇怪的代码。除了提取网页之外,您没有使用WWW::Mechanize,因此您也可以直接使用LWP::UserAgent。此外,如果您想要提取HTML资源的parr并打印它,HTML::TokeParser不是正确的工具

您似乎甚至没有阅读过该文档,因为$p->get_trimmed_text("moreLink")将返回所有文本,直到第一次出现<moreLink>元素,这不是有效的HTML标记。您拥有的是您刚刚找到的class元素的div属性的值

我会为此选择Mojolicious,因为它将获取页面,构建DOM,并对您指定的元素进行字符串化,而无需任何其他模块

我写过这个,但我目前无法测试它

use strict;
use warnings 'all';

use Mojo::UserAgent;

use constant URL => 'http://example.com/';

my $ua = Mojo::UserAgent->new;

my $txn = $ua->get(URL);

if ( my $err = $txn->error ) {
    die "@{$err}{qw/ code message /};
}

print $txn->res->dom->at('div.moduleBody')->to_string;