抓取页面脚本突然停止工作

时间:2020-05-24 07:24:29

标签: php web-scraping

我最近刚刚删除了我几个月前写的一个脚本,目的是要从Amazon获取一些基本数据(我现在可以访问API,所以不需要它了),但是这困扰着我为什么看不到脚本中的错误。

<?php

# Example URLs:
# https://www.amazon.co.uk/s?k=22+Bdc+Scope&ref=nb_sb_noss
# https://www.amazon.co.uk/s?k=Angled+Ceiling+Speaker&ref=nb_sb_noss

$url  = "https://www.amazon.co.uk/s?k=Angled+Ceiling+Speaker&ref=nb_sb_noss";       
$html = file_get_contents($url);    

echo parseHtmlAmazonScraper($html);

function parseHtmlAmazonScraper($html) {
    try {
        libxml_use_internal_errors(true);
        $doc = new DOMDocument();
        $doc->loadHTML($html);
        $xpath = new DomXPath($doc);
        $nodeList = $xpath->query("//div[@class='sg-col-4-of-24 sg-col-4-of-12 sg-col-4-of-36 sg-col-4-of-28 sg-col-4-of-16 sg-col sg-col-4-of-20 sg-col-4-of-32']");
        if (sizeof($nodeList) == 0) {
            $nodeList = $xpath->query("//div[@class='sg-col-4-of-12 sg-col-8-of-16 sg-col-16-of-24 sg-col-12-of-20 sg-col-24-of-32 sg-col sg-col-28-of-36 sg-col-20-of-28']");
        }
        $res = [];

        foreach ($nodeList as $node) {
            $new = new DomDocument;
            $new->appendChild($new->importNode($node, true));
            $N = new DomXPath($new);            
            $nodeImg  = $N->query("//img[@class='s-image']")->item(0);          
            $Img      = $nodeImg->getAttribute('src');          
            $nodeLink = $N->query("//a[@class='a-link-normal a-text-normal']")->item(0);
            $Path     = $nodeLink->getAttribute('href');
            $Name     = trim($nodeLink->textContent);       
            $res[] = [
                'productLink' => $Path,
                'productDescription' => $Name,
                'productImage' => $Img
            ];
        }
        return $res;
    } catch(Exception $e) {
        echo $e->getMessage();      
    }
}

?>

几个月前进行测试时,它运行良好,当我检查HTML结构时,我看不到任何明显的变化,我得到的是:

严重错误:未捕获的错误:调用成员函数getAttribute()为空

因此,当我执行$nodeList时,我基本上会从var_dump()返回

object(DOMNodeList)[47]
  public 'length' => int 44

测试页的HTML结构对我来说似乎不错,在这里我明显缺少什么吗?

任何朝着正确方向的帮助将不胜感激,我通常想了解事情为什么会破裂。

0 个答案:

没有答案