使用HTML :: TreeBuilder :: XPath

时间:2016-07-08 19:06:30

标签: html perl xpath

我需要解析许多HTML文档。这是一个数据示例,以便我可以更好地解释一下

<div id="filerDiv">
    <div class="mailer">Mailing Address
        <span class="mailerAddress">65 MARKET STREET, SUITE 1207,</span>
        <span class="mailerAddress">CAMANA BAY, P.O. BOX 31110</span>
        <span class="mailerAddress">GRAND CAYMAN E9 KY1-1205</span>
    </div>
    <div class="mailer">Business Address
        <span class="mailerAddress">65 MARKET STREET, SUITE 1207,</span>
        <span class="mailerAddress">CAMANA BAY, P.O. BOX 31110</span>
        <span class="mailerAddress">GRAND CAYMAN E9 KY1-1205</span>
        <span class="mailerAddress">345 943 4573</span>
    </div>
    <div class="companyInfo">
        <span class="companyName">GREENLIGHT CAPITAL RE, LTD. (Filer)
            <acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/browse-edgar?CIK=0001385613&amp;action=getcompany">0001385613 (see all company filings)</a></span>
        <p class="identInfo"><acronym title="Internal Revenue Service Number">IRS No.</acronym>: <strong>000000000</strong><br />Type: <strong>10-Q</strong> | Act: <strong>34</strong> | File No.: <a href="/cgi-bin/browse-edgar?filenum=001-33493&amp;action=getcompany"><strong>001-33493</strong></a> | Film No.: <strong>161612131</strong><br /><acronym title="Standard Industrial Code">SIC</acronym>: <b><a href="/cgi-bin/browse-edgar?action=getcompany&amp;SIC=6331&amp;owner=include">6331</a></b> Fire, Marine &amp; Casualty Insurance<br />Assistant Director 1</p>
    </div>
</div>

我需要使用类span抓取第二个div元素后面的四个mailer元素。这是我到目前为止的代码

my $root = HTML::TreeBuilder::XPath->new;
$root->parse($content);
my @Baddress = $root->findvalue('//div[@id="filerDiv"]/div[@class="mailer"][2]/span/text()');

但是当我打印出@Baddress的内容时,所有span文字显示在一行上,就像这样

65 MARKET STREET, SUITE 1207,CAMANA BAY, P.O. BOX 31110 GRAND CAYMAN E9 KY1-1205 345 943 4573 

将所有内容分配给单个数组元素。我希望将每个span分配给它自己的数组元素,以便可以单独解析它们。

1 个答案:

答案 0 :(得分:0)

经过几个小时的奴役,我错过了一个必不可少的元素。代码必须是这样的

my @Baddress = $root->findvalues('//div[@id="filerDiv"]/div[@class="mailer"][2]/span/text()');

我只需要$ root-&gt; findvalue,它将所有内容分配给1个变量。愚蠢的错误