想象一个HTML页面是一个具有重复结构的报告:
<html>
<body>
<h1>Big Hairy Report Page</h1>
<div class="customer">
<div class="customer_id">001</div>
<div class="customer_name">Joe Blough</div>
<div class="customer_addr">123 That Road</div>
<div class="customer_city">Smallville</div>
<div class="customer_state">Nebraska</div>
<div class="order_info">
<div class="shipping_details">
<ul>
<li>Large crate</li>
<li>Fragile</li>
<li>Express</li>
</ul>
</div>
<div class="order_item">Deluxe Hoodie</div>
<div class="payment">35.95</div>
<div class="order_id">000123456789</div>
</div>
<div class="comment">StackOverflow rocks!</div>
</div>
<div class="customer">
<div class="customer_id">002</div>
.... and so forth for a list of 150 customers
此类报告页面经常出现。我的目标是使用HTML::TreeBuilder::XPath
将每个客户的相关信息提取到一些合理的数据结构中。
我知道要做基础知识并将文件读入$ tree。但是,如何才能简明地循环遍历该树并为每个客户获取相关的信息集群?例如,我如何根据此信息创建按客户编号排序的地址标签列表?如果我想按州对所有客户信息进行排序怎么办?
我不是要求整个perl(我可以读取我的文件,输出到文件等)。我只需要帮助理解如何向HTML :: TreeBuilder :: XPath请求那些相关数据包,然后如何取消引用它们。如果用输出语句更容易表达这一点(即Joe Blough订购了1个豪华连帽衫并留下1条评论)那么这也很酷。
非常感谢那些解决这个问题的人,对我来说似乎有点压倒性。
答案 0 :(得分:3)
这将满足您的需求。
首先将所有<div class="customer">
元素拉入数组@customers
并从中提取信息。
我已经采用了地址标签的示例,按客户编号排序(我假设您的意思是class="customer_id"
字段)。所有地址值都从数组中提取到哈希%customers
中,由客户ID和元素类的名称键入。然后按ID的顺序打印信息。
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_file('html.html');
my @customers = $tree->findnodes('/html/body/div[@class="customer"');
my %customers;
for my $cust (@customers) {
my $id = $cust->findvalue('div[@class="customer_id"]');
for my $field (qw/ customer_name customer_addr customer_city customer_state /) {
my $xpath = "div[\@class='$field']";
my $val = $cust->findvalue($xpath);
$customers{$id}{$field} = $val;
}
}
for my $id (sort keys %customers) {
my $info = $customers{$id};
print "Customer ID $id\n";
print $info->{customer_name}, "\n";
print $info->{customer_addr}, "\n";
print $info->{customer_city}, "\n";
print $info->{customer_state}, "\n";
print "\n";
}
<强>输出强>
Customer ID 001
Joe Blough
123 That Road
Smallville
Nebraska
答案 1 :(得分:1)
我将使用XML::LibXML因为它更快并且我对它很熟悉,但是如果你将我发布的内容从XML :: LibXML转换为HTML :: TreeBuilder :: XPath应该非常简单所以欲望。
use XML::LibXML qw( );
sub get_text { defined($_[0]) ? $_[0]->textContent() : undef }
my $doc = XML::LibXML->load_html(...);
my @customers;
for my $cust_node ($doc->findnodes('/html/body/div[@class="customer"]')) {
my $id = get_text( $cust_node->findnodes('div[@class="customer_id"]') );
my $name = get_text( $cust_node->findnodes('div[@class="customer_name"]') );
...
push @customers, {
id => $id,
name => $name,
...
};
}
实际上,考虑到数据的规律性,您不必对字段名称进行硬编码。
use XML::LibXML qw( );
sub parse_list {
my ($node) = @_;
return [
map parse_field($_),
$node->findnodes('li')
];
}
sub parse_field {
my ($node) = @_;
my @children = $node->findnodes('*');
return $node->textContent() if !@children;
return parse_list($children[0]) if $children[0]->nodeName() eq 'ul';
return {
map { $_->getAttribute('class') => parse_field($_) }
@children
};
}
{
my $doc = XML::LibXML->load_html( ... );
my @customers =
map parse_field($_),
$doc->findnodes('/html/body/div[@class="customer"]');
...
}
答案 2 :(得分:1)
use HTML::TreeBuilder::XPath;
...
my @customers;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $mech->content() );
foreach my $customer_section_node ( $tree->findnodes('//div[ @class = "customer" ]') ) {
my $customer = {};
$customer->{id} = find_customer_id($customer_section_node);
$customer->{name} = find_customer_name($customer_section_node);
...
push @customers, $customer;
}
$tree->delete();
sub find_customer_id {
my $node = shift;
my ($id) = $node->findvalues('.//div[ @class = "customer_id" ]');
return $id
}