用PHP抓取Instagram-插件Wordpress

时间:2019-02-28 13:30:47

标签: php wordpress web-scraping instagram

我正在尝试从没有访问令牌的Instagram中检索用户的图片。为此,我使用PHP构建了一个scraper。 它工作得很好,但有时并且仅在某些帐户下不起作用。

这里有抓取Instagram的功能:

function get_instagram_feed( $number, $username ) {
        error_reporting(0);

        require 'simple-cache.php';
        $cacheFolder = 'instagram-cache';

        $user = strtolower( $username );
        if (!file_exists($cacheFolder)) {
            mkdir($cacheFolder, 0777, true);
        }
        $cache = new Gilbitron\Util\SimpleCache();
        $cache->cache_path = $cacheFolder . '/';
        $cache->cache_time = 3600;
        $scraped_website = $cache->get_data("user-$user", "https://www.instagram.com/$user/");
        $document = new DOMDocument();
        libxml_use_internal_errors(true);
        $document->loadHTML($scraped_website);
        libxml_use_internal_errors(false);
        $selector = new DOMXPath($document);
        $anchors = $selector->query('/html/body//script');

        $images = array();
        $insta_feed = array();

        foreach($anchors as $a) {
            $text = $a->nodeValue;
            preg_match('/window._sharedData = \{(.*?)\};/', $text, $matches);
            $json = json_decode('{' . $matches[1] . '}', true);
            $images[] = $json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'];
        }

        for ( $i = 0; $i < count($images); $i++ ) {
            $insta_feed[] = array(
                'thumbnail' => $images[0][$i]['node']['thumbnail_resources'][0]['src'],
                'small' => $images[0][$i]['node']['thumbnail_resources'][1]['src'],
                'medium' => $images[0][$i]['node']['thumbnail_resources'][2]['src'],
                'large' => $images[0][$i]['node']['thumbnail_resources'][4]['src'],
                'original' => $images[0][$i]['node']['display_url'],
                'link'  => trailingslashit( '//instagram.com/p/' . $images[0][$i]['node']['shortcode'] ),
                'caption' => $images[0][$i]['node']['edge_media_to_caption']['edges'][0]['node']['text']
            );
        }

        if ( !empty( $insta_feed ) ) {
            return ( $number ) ? array_slice( $insta_feed, 0, $number ) : $insta_feed;
        }
    }

工作正常时,在foreach中,我有一个$aDOMElement Object,我可以浏览该图像以获取图像URL。

当它不起作用时,我的$a如下所示:

DOMElement Object ( [tagName] => script [schemaTypeInfo] => [nodeName] => script [nodeValue] => (function(){ function normalizeError(err) { var errorInfo = err.error || {}; var getConfigProp = function(propName, defaultValueIfNotTruthy) { var propValue = window._sharedData && window._sharedData[propName]; return propValue ? propValue : defaultValueIfNotTruthy; }; return { line: err.line || errorInfo.message || 0, column: err.column || 0, name: 'InitError', message: err.message || errorInfo.message || '', script: errorInfo.script || '', stack: errorInfo.stackTrace || errorInfo.stack || '', timestamp: Date.now(), ref: window.location.href, deployment_stage: getConfigProp('deployment_stage', ''), is_canary: getConfigProp('is_canary', false), rollout_hash: getConfigProp('rollout_hash', ''), is_prerelease: window.__PRERELEASE__ || false, bundle_variant: getConfigProp('bundle_variant', null), request_url: err.url || window.location.href, response_status_code: errorInfo.statusCode || 0 } } window.addEventListener('load', function(){ if (window.__bufferedErrors && window.__bufferedErrors.length) { if (window.caches && window.caches.keys && window.caches.delete) { window.caches.keys().then(function(keys) { keys.forEach(function(key) { window.caches.delete(key) }) }) } window.__bufferedErrors.map(function(error) { return normalizeError(error) }).forEach(function(normalizedError) { var request = new XMLHttpRequest(); request.open('POST', '/client_error/', true); request.setRequestHeader('Content-Type', 'application/json; charset=utf-8'); request.send(JSON.stringify(normalizedError)); }) } }) }()); [nodeType] => 1 [parentNode] => (object value omitted) [childNodes] => (object value omitted) [firstChild] => (object value omitted) [lastChild] => (object value omitted) [previousSibling] => (object value omitted) [nextSibling] => [attributes] => (object value omitted) [ownerDocument] => (object value omitted) [namespaceURI] => [prefix] => [localName] => script [baseURI] => [textContent] => (function(){ function normalizeError(err) { var errorInfo = err.error || {}; var getConfigProp = function(propName, defaultValueIfNotTruthy) { var propValue = window._sharedData && window._sharedData[propName]; return propValue ? propValue : defaultValueIfNotTruthy; }; return { line: err.line || errorInfo.message || 0, column: err.column || 0, name: 'InitError', message: err.message || errorInfo.message || '', script: errorInfo.script || '', stack: errorInfo.stackTrace || errorInfo.stack || '', timestamp: Date.now(), ref: window.location.href, deployment_stage: getConfigProp('deployment_stage', ''), is_canary: getConfigProp('is_canary', false), rollout_hash: getConfigProp('rollout_hash', ''), is_prerelease: window.__PRERELEASE__ || false, bundle_variant: getConfigProp('bundle_variant', null), request_url: err.url || window.location.href, response_status_code: errorInfo.statusCode || 0 } } window.addEventListener('load', function(){ if (window.__bufferedErrors && window.__bufferedErrors.length) { if (window.caches && window.caches.keys && window.caches.delete) { window.caches.keys().then(function(keys) { keys.forEach(function(key) { window.caches.delete(key) }) }) } window.__bufferedErrors.map(function(error) { return normalizeError(error) }).forEach(function(normalizedError) { var request = new XMLHttpRequest(); request.open('POST', '/client_error/', true); request.setRequestHeader('Content-Type', 'application/json; charset=utf-8'); request.send(JSON.stringify(normalizedError)); }) } }) }()); )

我真的不明白。形式localhost的运作方式始终像是一种魅力。在一个具有某些帐户的实时网站上,我得到了DOMElement Object,但没有任何有趣的数据。 这种情况大多发生在“经过验证的”帐户中。

有人可以帮我解决这个小挑战吗?

0 个答案:

没有答案