我一直在尝试编写一个perl脚本来废弃亚马逊并下载产品评论但我无法这样做。 我一直在使用perl模块LWP :: Simple和HTML :: TreeBuilder :: XPath来实现这一点。
对于HTML
<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small">
<span class="a-size-mini a-color-state a-text-bold">
Verified Purchase
</span>
<div class="a-section">
I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes
</div>
</div>
</div>
</div>
我想提取产品评论。为此,我写道: -
use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";
# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('div[@class ="a-section"]');
foreach (@data)
{
print "$_\n";
}
但我没有得到任何输出。任何人都可以指出我的错误吗?
答案 0 :(得分:1)
我认为XPath应该是'//div[@class ="a-section"]'
(表达式开头的额外 // ,以便在HTML中的任何位置找到div
)
答案 1 :(得分:0)
正如choroba所说,您的XPath表达式应以//
开头,以查找div
类型的后代。按照目前的情况,您在文档的根目录中搜索<div>
个元素,但没有。{/ p>
您实际上每个class
元素的a-section
属性都包含div
class
的{{1}}属性多个类,如
class="a-section a-subheader a-breadcrumb celwidget"
并且您希望其中任何一个成为a-section
。
有几种方法可以解决这个问题。最明显的是使用XPath contains 来查看a-section
是否出现在类字符串中的任何位置,就像这样
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder::XPath;
my $asin = 'B0031EJBI4';
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";
my $tree = HTML::TreeBuilder::XPath->new->parse(get $url);
my @nodes = $tree->findnodes('//div[contains(@class, "a-section")]');
say scalar @nodes;
报告页面中的60个此类节点。这是正确的结果,您可能不想再进一步,但解决方案不是安全的,因为它将匹配像
这样的节点<div class="aaa-sections">
也是。要正确解决此问题,您需要恢复为非XPath HTML::Element
方法look_down
,就像这样,a-section
之前和之后坚持单词边界。
my @nodes = $tree->look_down(
_tag => 'div',
class => qr/\ba-section\b/,
);
say scalar @nodes;
同样,结果是正确的64。
但即使该解决方案也不允许以-section
之类的非单词字符开头或结尾的类,因为永远不会找到/\b-section\b/
。最通用的解决方案是在look_down
条件中使用子例程,这样就可以在空格上拆分类字符串(' '
是正确的:不要为/ /
更改它或/\s+/
)并构建使用所有子字符串作为键的%classes
哈希。那么a-section
类的存在就是$classes{'a-section'}
@nodes = $tree->look_down(
_tag => 'div',
sub {
return unless my $class = $_[0]->attr('class');
my %classes = map { $_ => 1 } split ' ', $class;
$classes{'a-section'};
}
);
say scalar @nodes;
此页面的结果再次为64,但此解决方案适用于任何类字符串。
答案 2 :(得分:-1)
use LWP::Simple;
#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;
# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";
# Assemble the URL from the passed ASIN.
my $url = "http://www.amazon.com/gp/product/B00R3DO58K/ref=s9_ri_gw_g74_i2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-3&pf_rd_r=01F13XCKC1KBQAJ4EY87&pf_rd_t=36701&pf_rd_p=1970558902&pf_rd_i=desktop";
# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('"'=>'"', '&'=>'&', ' '=>' ');
my $unescape_re = join '|' => keys %unescape;
# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('//span[@class="vtp-byline-text"]');
#print $content;
foreach (@data)
{
print "$_\n";
}