Question

所以我正在寻找解析详细信息页面的解决方案，例如http://www.amazon.com/Code-Cloud-Pragmatic-Programmers-Chu-Carroll/dp/1934356638/ref=sr_1_1?ie=UTF8&qid=1359231803&sr=8-1&keywords=code+in+the+cloud，但我无法从页面中获取正确的内容。

我已经检查了这些元素，并且发现了一个名为“btAsinTitle”的id，它应该从Amazon.com产品详细信息页面中获取标题，但显然，PHP中没有任何内容。除此之外，我还发现它没有通过外部资源加载，比如一个JavaScript从Amazon.com端的外部资源引入（但是，我并非完全100％肯定）。我所做的是查看已加载的文档，看起来文档已加载到我上面提供的确切URL，其中包含我正在寻找的正确的“btAsinTitle”ID。

这真是我解析细节的小任务的第一步。我还需要一些其他标准，包括作者，价格，可用性（产品是否有库存）。下面是我正在尝试运行的片段。

此外，这个问题只是一个额外的好奇心，有什么技术可以用来防止刮擦，亚马逊有可能阻止他们的产品页面被刮掉吗？除此之外，我也知道我可以使用API，但是我试图在不使用API的情况下遵守分配规则，并且还为分配注册API密钥。提前谢谢！

class AmazonBook {
protected $doc;

public $url;
public $title;
public $author;
public $price;
public $availability;

public function __construct($url) {
    $this->url = $url;

    $this->set_dom();
    // $this->set_availability();
    // $this->set_price();
    // $this->set_author();
    $this->set_title();
}


// Sets the title
protected function set_title() {
    var_dump($this->doc->getElementById('btAsinTitle'));
    die();

    // foreach ($this->doc->getElementsByTagName('span') as $span) {
    //  var_dump($span->nodeValue);
    // }
    // die();
}

// Sets the DOM
protected function set_dom() {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $this->url);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1');

    $this->doc = new DOMDocument();
    @$this->doc->loadHTML(curl_exec($ch));
}
}

// Test code
$url = 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=code%20in%20the%20cloud';
$code_in_cloud = new AmazonBook($url);

用PHP解析亚马逊详细信息页面

0 个答案: