我正在尝试将HTML网页中的文本和标签提取到文本文件中。
以下是输入网页内容(在视图中查看时:源模式):
<div class="moduleBody">In addition, <b>ABC provides</b> dual finishing and detailing <u>products</u>, including a system of cleaners, dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics Business</p><p></p><p>The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its <b>product offerings</b> include personal protection products, such as <u>respiratory, hearing, eye and fall protection</u> equipment;<div class="moreLink">
以下代码可以单独提取文本,但它正在取消<p>
,</p>
,<u>
,</u>
,<b>
和</b>
以及其他HTML标记,我想保留它们。
use WWW::Mechanize;
use threads;
my $mech = WWW::Mechanize->new;
my $Lvalue = "";
$mech->get($link);
$mech->quiet(1);
my $p = HTML::TokeParser->new(\$mech->content);
while ( my $tag1 = $p->get_tag('div') ) {
if ( $tag1->[1]{class} and $tag1->[1]{class} eq 'moduleBody' ) {
$Lvalue = $p->get_trimmed_text("moreLink");
$Lvalue =~ s/$find1/|/g;
$Lvalue =~ s/$find2/|/g;
print $fh "$ticker^|$Lvalue\n";
}
}
以上代码的输出是:
In addition, ABC provides dual finishing and detailing products, including a system of cleaners, dressings, polishes, waxes and other products. Safety and Graphics Business The Safety and Graphics segment serves a range of markets for the safety, security and productivity of people, facilities and systems. Its product offerings include personal protection products, such as respiratory, hearing, eye and fall protection equipment;
实际上,我的代码正在删除我想要保留的HTML标记。 我觉得可能需要调整“get_trimmed_text”来保留p,/ p,b和/ b(以及其他html)标签。 有人可以帮助您对代码进行任何必要的更改吗?
明确说明要求:
我正在寻找一个perl函数,它可以帮助提取网页上“<div class="moduleBody">
”和“<div class="moreLink">
”之间的(TEXT + ALL HTML TAGS)(如上面的示例输入文本中所引用) 。除了get_trimmed_text之外,我还可以使用其他功能。
非常感谢。
回答这个问题 - 对于普通观众来说
@SinanÜnür提供的回复效果很好。谢谢@SinanÜnür! +1并将其标记为答案。
为了普通观众的利益,请注意,只要您将HTML内容保留在“my $html = <<HTML;
”变量中,SinanÜnür的代码就能正常运行。如果您正在阅读URL,则需要对代码进行一些调整以包含以下内容:
use LWP::Simple;
my $url = "http://www.example.com/profile?item=66&class=XYZ";
my $html = get($url);
答案 0 :(得分:1)
在对问题进行更新后回答更新。
我正在寻找一个perl函数,它可以帮助提取位于&#34;
<div class="moduleBody">
&#34;之间的(TEXT + ALL HTML TAGS)。和&#34;<div class="moreLink">
&#34;在网页上(如上面的示例输入文本中所引用)。
HTML::TokeParser
是一个流解析器:你要求令牌或标签(这是特定种类的令牌。所以,使用这个模块,你会要求解析器找到下一个div
,检查是否它是正确的类,如果是,则开始累积所有后续标记的内容,直到<div class="moreLink">
开始标记。
#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TokeParser::Simple;
my $html = <<HTML;
<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;<div
class="moreLink">
HTML
my $p = HTML::TokeParser::Simple->new(\$html);
my $start = { tag => 'div', class => 'moduleBody' };
my $end = { tag => 'div', class => 'moreLink' };
while ( defined(my $chunk = extract_html_between($p, $start, $end)) ) {
print "[[[$chunk]]]\n"
}
sub extract_html_between {
my $p = shift;
my $start = shift;
my $end = shift;
my $chunk;
while (my $tag = $p->get_tag($start->{tag})) {
my $class = $tag->get_attr('class');
next unless $class and $class eq $start->{class};
$chunk = $tag->as_is; # only if you want the opening div
CHUNK:
while (my $token = $p->get_token) {
if ( $token->is_start_tag($end->{tag}) ) {
$class = $token->get_attr('class');
last CHUNK if $class and $class eq $end->{class};
}
$chunk .= $token->as_is;
}
}
return $chunk;
}
输出:
[[[<div class="moduleBody">In addition, <b>ABC provides</b>
dual finishing and detailing <u>products</u>, including a system of cleaners,
dressings, polishes, waxes and other products.</p><p></p><p>Safety and Graphics
Business</p><p></p><p>The Safety and Graphics segment serves a range of markets
for the safety, security and productivity of people, facilities and systems.
Its <b>product offerings</b> include personal protection products, such as
<u>respiratory, hearing, eye and fall protection</u> equipment;]]]
答案 1 :(得分:1)
这是非常奇怪的代码。除了提取网页之外,您没有使用WWW::Mechanize
,因此您也可以直接使用LWP::UserAgent
。此外,如果您想要提取HTML资源的parr并打印它,HTML::TokeParser
不是正确的工具
您似乎甚至没有阅读过该文档,因为$p->get_trimmed_text("moreLink")
将返回所有文本,直到第一次出现<moreLink>
元素,这不是有效的HTML标记。您拥有的是您刚刚找到的class
元素的div
属性的值
我会为此选择Mojolicious
,因为它将获取页面,构建DOM,并对您指定的元素进行字符串化,而无需任何其他模块
我写过这个,但我目前无法测试它
use strict;
use warnings 'all';
use Mojo::UserAgent;
use constant URL => 'http://example.com/';
my $ua = Mojo::UserAgent->new;
my $txn = $ua->get(URL);
if ( my $err = $txn->error ) {
die "@{$err}{qw/ code message /};
}
print $txn->res->dom->at('div.moduleBody')->to_string;