我在解析时没有得到HTML标记

时间:2015-07-17 04:51:07

标签: perl parsing web mechanize

我要解析的HTML代码片段是这样的:

<ul class="authors">
    <li class="author" itemprop="author" itemscope="itemscope" itemtype="http://schema.org/Person">
        <a href="/search?facet-creator=%22Charles+L.+Fefferman%22" itemprop="name">Charles L. Fefferman</a>,
    </li>
    <li class="author" itemprop="author" itemscope="itemscope" itemtype="http://schema.org/Person">
        <a href="/search?facet-creator=%22Jos%C3%A9+L.+Rodrigo%22" itemprop="name">José L. Rodrigo</a>
    </li>

我想提取整个<a>元素,但是当我尝试用WWW::Mechanize::TreeBuilder解析它时,我得到的唯一内容就是作者的名字。所以:

内容我期待:

<a href="/search?facet-creator=%22Charles+L.+Fefferman%22" itemprop="name">Charles L. Fefferman</a>,

<a href="/search?facet-creator=%22Jos%C3%A9+L.+Rodrigo%22" itemprop="name">José L. Rodrigo</a>

内容我收到:

Charles L. Fefferman,
José L. Rodrigo

以下是负责解析此问题的代码:

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get($addressdio);

my @authors = $mech->look_down('class', 'author');

print "Authors: <br />";
foreach ( @authors ) {
    say $_->as_text(), "<br />";
}

我认为这可能与as_text()有关,而且当CGI获取HTML时它不会将其视为文本。

1 个答案:

答案 0 :(得分:3)

我处理它,但完全不同 - 使用HTML :: TagParser:

my $html = HTML::TagParser->new("overwrite.xml");
my @li = $html->getElementsByAttribute('class','author');

foreach(@li){
    my $a = $_->firstChild();
    my $link = $a->getAttribute('href');
    say $_->innerText;

    say $link;
}