如何解析并从给定网址的meta标签获取图像,描述?

时间:2019-03-24 08:03:53

标签: php meta-tags

我试图从meta标签获取图像和其他数据。 您能指导我如何从特定的网址获取图片吗?

例如网址:

  1. https://www.myntra.com/casual-shoes/kook-n-keech/kook-n-keech-men-white-sneakers/2154180/buy

    2。 https://www.amazon.in/Redmi-Pro-Black-32GB-Storage/dp/B07DJL15QT/ref=lp_16113280031_1_1?srs=16113280031&ie=UTF8&qid=1553411505&sr=8-1

    3。https://www.flipkart.com/asian-wndr-13-training-shoes-walking-shoes-gym-shoes-sports-shoes-running-men/p/itmfatksqm2wzfw8?pid=SHOF3KF5XZZHCMBD&lid=LSTSHOF3KF5XZZHCMBDS561HI&marketplace=FLIPKART&spotlightTagId=BestsellerId_osp%2Fcil&srno=b_1_1&otracker=hp_omu_Deals%2Bof%2Bthe%2BDay_2_XI7YOJ4F5LAF_0&otracker1=hp_omu_PINNED_neo%2Fmerchandising_Deals%2Bof%2Bthe%2BDay_NA_dealCard_cc_2_NA_0&fm=neo%2Fmerchandising&iid=11b0262a-d573-4a8d-9938-55051f6474c9.SHOF3KF5XZZHCMBD.SEARCH&ppt=StoreBrowse&ppn=Store&ssid=gvvzlooffk0000001553411768922

代码:

 function getUrlData($url) {    
        $result = false;

        $contents = getUrlContents($url);

        if (isset($contents) && is_string($contents)) {
            $title = null;
            $metaTags = null;

            preg_match('/<title>([^>]*)<\/title>/si', $contents, $match);

            if (isset($match) && is_array($match) && count($match) > 0) {
                $title = strip_tags($match[1]);
            }

            preg_match_all('/<[\s]*meta[\s]*name="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

            if (isset($match) && is_array($match) && count($match) == 3) {
                $originals = $match[0];
                $names = $match[1];
                $values = $match[2];

                if (count($originals) == count($names) && count($names) == count($values)) {
                    $metaTags = array();

                    for ($i = 0, $limiti = count($names); $i < $limiti; $i++) {
                        $metaTags[$names[$i]] = array(
                            'html' => htmlentities($originals[$i]),
                            'value' => $values[$i]
                        );
                    }
                }
            }

            $result = array(
                'title' => $title,
                'metaTags' => $metaTags
            );
        }
        return $result;
        }


        function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0) {
        $result = false;

        $contents = @file_get_contents($url);


        // Check if we need to go somewhere else

        if (isset($contents) && is_string($contents)) {
            preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);

            if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1) {
                if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections) {
                    return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
                }

                $result = false;
            } else {
                $result = $contents;
            }
        }

            return $contents;
        }

        $test = getUrlData('https://www.amazon.in/Redmi-Pro-Black-32GB-Storage/dp/B07DJL15QT/ref=lp_16113280031_1_1?srs=16113280031&ie=UTF8&qid=1553411505&sr=8-1');  //Replace  with your URL 

这里

echo '<pre>';
print_r($test);
  • 第一个URL的结果:空白
    来自第二个URL的结果:2nd URL

    第3个URL的结果:3rd URL

我无法从此URL和第一个url找到图像数据。

1 个答案:

答案 0 :(得分:0)

使用DomDocumentDOMXPath解析从给定URL中检索到的html:

    function outputMetaTags($url){
       // $url = 'https://www.myntra.com/casual-shoes/kook-n-keech/kook-n-keech-men-white-sneakers/2154180/buy';
        $streamContext = stream_context_create(array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
            'follow_location' => false
         )
       )
       ); //we try to act as browser, just in case server forbids us to access to page 

        $htmlData = file_get_contents($url, false, $streamContext); //fetch the html data from given url
        //libxml_use_internal_errors(true); //optionally disable libxml url errors and warnings
        $doc = new  DOMDocument(); //parse with DOMDocument
        $doc->loadHTML($htmlData);
        $xpath = new  DOMXPath($doc); //create DOMXPath object and parse loaded DOM from HTML
        $query = '//*/meta';

        $metaData = $xpath->query($query);
        foreach ($metaData as $singleMeta) {
            //for og:image, check if $singleMeta->getAttribute('property') === 'og:image', same goes with og:url
            //not every meta has property or name attribute
            if(!empty($singleMeta->getAttribute('property'))){
                echo $singleMeta->getAttribute('property') . "\n";
            }elseif(!empty($singleMeta->getAttribute('name'))){
                echo $singleMeta->getAttribute('name')  . "\n";
            }
            //get content from meta tag
            echo $singleMeta->getAttribute('content')  . "\n";

        }
}

详细了解DOMDocument和DOMXpath:

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

关于元标记:

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta