如何从HTML中提取Amazon评论?

时间:2015-04-01 08:15:36

标签: perl web-scraping

我一直在尝试编写一个perl脚本来废弃亚马逊并下载产品评论但我无法这样做。 我一直在使用perl模块LWP :: Simple和HTML :: TreeBuilder :: XPath来实现这一点。

对于HTML

<div id="revData-dpReviewsMostHelpfulAUI-R1GQHD9GMGBDXP" class="a-row a-spacing-small">
  <span class="a-size-mini a-color-state a-text-bold">
    Verified Purchase
  </span>
  <div class="a-section">
    I bought this to replace an earlier model that got lost in transit when we moved. It is a real handy helper to have when making tortillas. Follow the recipe for flour tortillas in the little recipe book that comes with it. I make a few changes

  </div>
</div>

</div>
</div>

我想提取产品评论。为此,我写道: -

use LWP::Simple;

#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);
die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('div[@class ="a-section"]');

foreach (@data)
{
    print "$_\n";
}

但我没有得到任何输出。任何人都可以指出我的错误吗?

3 个答案:

答案 0 :(得分:1)

我认为XPath应该是'//div[@class ="a-section"]'(表达式开头的额外 // ,以便在HTML中的任何位置找到div

答案 1 :(得分:0)

正如choroba所说,您的XPath表达式应以//开头,以查找div类型的后代。按照目前的情况,您在文档的根目录中搜索<div>个元素,但没有。{/ p>

您实际上每个class元素的a-section属性都包含div class的{​​{1}}属性多个类,如

class="a-section a-subheader a-breadcrumb celwidget"

并且您希望其中任何一个成为a-section

有几种方法可以解决这个问题。最明显的是使用XPath contains 来查看a-section是否出现在类字符串中的任何位置,就像这样

use strict;
use warnings;

use LWP::Simple;
use HTML::TreeBuilder::XPath;

my $asin = 'B0031EJBI4';

my $url = "http://amazon.com/o/tg/detail/-/$asin/?vi=customer-reviews";

my $tree = HTML::TreeBuilder::XPath->new->parse(get $url);

my @nodes = $tree->findnodes('//div[contains(@class, "a-section")]');

say scalar @nodes;

报告页面中的60个此类节点。这是正确的结果,您可能不想再进一步,但解决方案不是安全的,因为它将匹配像

这样的节点
<div class="aaa-sections">

也是。要正确解决此问题,您需要恢复为非XPath HTML::Element方法look_down,就像这样,a-section之前和之后坚持单词边界。

my @nodes = $tree->look_down(
  _tag => 'div',
  class => qr/\ba-section\b/,
);

say scalar @nodes;

同样,结果是正确的64。

但即使该解决方案也不允许以-section之类的非单词字符开头或结尾的类,因为永远不会找到/\b-section\b/。最通用的解决方案是在look_down条件中使用子例程,这样就可以在空格上拆分类字符串(' '是正确的:不要为/ /更改它或/\s+/)并构建使用所有子字符串作为键的%classes哈希。那么a-section类的存在就是$classes{'a-section'}

的值
@nodes = $tree->look_down(
  _tag => 'div',
  sub {
    return unless my $class = $_[0]->attr('class');
    my %classes = map { $_ => 1 } split ' ', $class;
    $classes{'a-section'};
  }
);

say scalar @nodes;

此页面的结果再次为64,但此解决方案适用于任何类字符串。

答案 2 :(得分:-1)

use LWP::Simple;

#use HTML::TreeBuilder;
use HTML::TreeBuilder::XPath;

# Take the ASIN from the command line.
my $asin = shift @ARGV or die "Usage: perl get_reviews.pl <asin>\n";

# Assemble the URL from the passed ASIN.
my $url = "http://www.amazon.com/gp/product/B00R3DO58K/ref=s9_ri_gw_g74_i2?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-3&pf_rd_r=01F13XCKC1KBQAJ4EY87&pf_rd_t=36701&pf_rd_p=1970558902&pf_rd_i=desktop";

# Set up unescape-HTML rules. Quicker than URI::Escape.
my %unescape = ('&quot;'=>'"', '&amp;'=>'&', '&nbsp;'=>' ');
my $unescape_re = join '|' => keys %unescape;

# Request the URL.
my $content = get($url);



die "Could not retrieve $url" unless $content;
my $tree = HTML::TreeBuilder::XPath->new_from_content( $content);
my @data = $tree->findvalues('//span[@class="vtp-byline-text"]');


#print $content;

foreach (@data)
{
    print "$_\n";
}