使用Web :: Scraper刮取表#id列

时间:2014-07-21 20:52:16

标签: perl

有一个html页面,结构为:

  • 有一张id="searchResult"
  • 的表格
  • 多行
  • 每个包含3个td - 没有任何类
  • 在每个表格中,单元格包含一个网址,我需要第二个单元格(列)中的网址

尝试了不同的XPATH刮刀:

my $links = scraper {
    process '//table[id="searchResult"]', "lines[]" => scraper {
        process "//tr/td[2]/a", text => 'TEXT';
        process "//tr/td[2]/a", link => '@href';
    };
};
my $res = $links->scrape($html);

但不起作用且$ res为空{}

如果有人需要,这里是完整的测试代码:

use 5.014;
use warnings;

use Web::Scraper;
use Data::Dumper;

my $links = scraper {
    process '//table[id="searchResult"]', "lines[]" => scraper {
        process "//tr/td[2]/a", text => 'TEXT';
        process "//tr/td[2]/a", link => '@href';
    };
};

my $html = do {local $/;<DATA>};
#say $html;

my $res = $links->scrape($html);
say Dumper $res;

__DATA__
<html>
<body>
<p>...</p>
<table id="searchResult">
    <thead><th>x</th><th>x</th><th>x</th><th>x</th><th>x</th></thead>
    <tr>
    <td><a href="#11">cell11</a></td>
    <td><a href="#12">cell12</a></td>
    <td><a href="#13">cell13</a></td>
    </tr>
    <tr>
    <td><a href="#21">cell21</a></td>
    <td><a href="#22">cell22</a></td>
    <td><a href="#23">cell23</a></td>
    </tr>
    <tr>
    <td><a href="#31">cell31</a></td>
    <td><a href="#32">cell32</a></td>
    <td><a href="#33">cell33</a></td>
    </tr>
</table>
</body>
</html>

1 个答案:

答案 0 :(得分:3)

此类项目的首选刮刀是Mojo::DOM。如需有用的8分钟介绍性视频,请查看Mojocast Episode 5

您也可以使用指向CSS Selector Reference的指针。

以下内容执行您尝试对此模块执行的解析:

use strict;
use warnings;

use Mojo::DOM;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $link ($dom->find('table[id=searchResult] > tr > td:nth-child(2) > a')->each) {
    print $link->{href}, " - ", $link->text, "\n";
}

__DATA__
<html>
<body>
<p>...</p>
<table id="searchResult">
    <thead><th>x</th><th>x</th><th>x</th><th>x</th><th>x</th></thead>
    <tr>
    <td><a href="#11">cell11</a></td>
    <td><a href="#12">cell12</a></td>
    <td><a href="#13">cell13</a></td>
    </tr>
    <tr>
    <td><a href="#21">cell21</a></td>
    <td><a href="#22">cell22</a></td>
    <td><a href="#23">cell23</a></td>
    </tr>
    <tr>
    <td><a href="#31">cell31</a></td>
    <td><a href="#32">cell32</a></td>
    <td><a href="#33">cell33</a></td>
    </tr>
</table>
</body>
</html>

输出:

#12 - cell12
#22 - cell22
#32 - cell32