PHP-如何使用preg_match_all获取具有特定类名称的img标签的src?

时间:2019-05-16 19:14:24

标签: php html regex web-scraping

我正在尝试从Amazon产品搜索列表页面创建一个刮板。

方法:

function getHTMLcode($url) {

    $curl = curl_init($url);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
    curl_setopt($curl, CURLOPT_ENCODING, 'identity');
    curl_setopt($curl, CURLOPT_FAILONERROR, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($curl);
    curl_close($curl);

    return $html;

}

方法调用:

  $url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";

  $html= getHTMLcode($url);
  $image = '/src="(?P<img>[^"]*)"/';  
  preg_match_all($image,$html,$data);
  var_dump($data);

问题:这将返回页面上存在的所有src标记。我只需要具有class = "s-image"但不返回h2(产品标题)和价格标签的产品。

问题:如何仅从亚马逊产品搜索列表中获取具有特定类别名称的图像,标题和价格标签。 亚马逊退货

<img src="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg" class="s-image" alt="Apple iPhone Xs Max with FaceTime - 256GB, 4G LTE, Space Gray" srcset="https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL436_.jpg 1x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL654_FMwebp_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL872_FMwebp_QL65_.jpg 2x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1090_FMwebp_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/61QSgY4zXNL._AC_UL1308_FMwebp_QL65_.jpg 3x" data-image-index="0" data-image-load="" data-image-latency="s-product-image" data-image-source-density="1">

类似地;获得我正在尝试的产品的标题和价格

 $title = '/<h2 class="a-size-mini a-spacing-none a-color-base s-line-clamp-2">(?P<val>[^>]*)<\/h2>/'; 
    preg_match_all($title,$html,$value);
     var_dump($value);
    $price ='/<span class="a-price-whole><span class="a-price-symbol">&nbsp;&nbsp;<\/span>(?P<price>[^>]*)<\/span>/';
    preg_match_all($price,$html,$cost);

     var_dump($value);

1 个答案:

答案 0 :(得分:3)

您使用了错误的工具。您应该使用HTML解析器来执行此操作,并使用XPath查询来查找所需内容:

<?php
$url="http://www.amazon.com/s/?url=search-alias%3Daps&field-keywords=iphone";
$html= getHTMLcode($url);
$dom = new DomDocument();
libxml_use_internal_errors();
$dom->loadHTML($html);
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//img[contains(@class, 's-image')]/@src");
foreach ($nodes as $node) {
    $data[] = $node->textContent;
}
print_r($data);

输出:

Array
(
    [0] => https://m.media-amazon.com/images/I/418H4DiygbL._AC_UL436_.jpg
    [1] => https://m.media-amazon.com/images/I/61IzJCh8i8L._AC_UL436_.jpg
    [2] => https://m.media-amazon.com/images/I/71RYhD1uzpL._AC_UL436_.jpg
    [3] => https://m.media-amazon.com/images/I/41jUosGQiDL._AC_UL436_.jpg
    [4] => https://m.media-amazon.com/images/I/51CBPR-l2VL._AC_UL436_.jpg
    [5] => https://m.media-amazon.com/images/I/813nLXVhnwL._AC_UL436_.jpg
    [6] => https://m.media-amazon.com/images/I/61WpoMEdpoL._AC_UL436_.jpg
    [7] => https://m.media-amazon.com/images/I/913VoEdo-4L._AC_UL436_.jpg
    [8] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
    [9] => https://m.media-amazon.com/images/I/81s7ZLOGOWL._AC_UL436_.jpg
    [10] => https://m.media-amazon.com/images/I/513R4aVg1cL._AC_UL436_.jpg
    [11] => https://m.media-amazon.com/images/I/51BbI-8wpTL._AC_UL436_.jpg
    [12] => https://m.media-amazon.com/images/I/61pRPj+-IYL._AC_UL436_.jpg
    [13] => https://m.media-amazon.com/images/I/71x3e0x+M2L._AC_UL436_.jpg
    [14] => https://m.media-amazon.com/images/I/6165FLUs1+L._AC_UL436_.jpg
    [15] => https://m.media-amazon.com/images/I/81ZJNQZBFCL._AC_UL436_.jpg
    [16] => https://m.media-amazon.com/images/I/51sTR66B1UL._AC_UL436_.jpg
    [17] => https://m.media-amazon.com/images/I/71QxMMTKiVL._AC_UL436_.jpg
    [18] => https://m.media-amazon.com/images/I/61OUrdtiDcL._AC_UL436_.jpg
    [19] => https://m.media-amazon.com/images/I/71ktNlpWWdL._AC_UL436_.jpg
    [20] => https://m.media-amazon.com/images/I/51x3FM83EQL._AC_UL436_.jpg
    [21] => https://m.media-amazon.com/images/I/41-Mv2nSrNL._AC_UL436_.jpg
)