我正在尝试从没有访问令牌的Instagram中检索用户的图片。为此,我使用PHP构建了一个scraper。 它工作得很好,但有时并且仅在某些帐户下不起作用。
这里有抓取Instagram的功能:
function get_instagram_feed( $number, $username ) {
error_reporting(0);
require 'simple-cache.php';
$cacheFolder = 'instagram-cache';
$user = strtolower( $username );
if (!file_exists($cacheFolder)) {
mkdir($cacheFolder, 0777, true);
}
$cache = new Gilbitron\Util\SimpleCache();
$cache->cache_path = $cacheFolder . '/';
$cache->cache_time = 3600;
$scraped_website = $cache->get_data("user-$user", "https://www.instagram.com/$user/");
$document = new DOMDocument();
libxml_use_internal_errors(true);
$document->loadHTML($scraped_website);
libxml_use_internal_errors(false);
$selector = new DOMXPath($document);
$anchors = $selector->query('/html/body//script');
$images = array();
$insta_feed = array();
foreach($anchors as $a) {
$text = $a->nodeValue;
preg_match('/window._sharedData = \{(.*?)\};/', $text, $matches);
$json = json_decode('{' . $matches[1] . '}', true);
$images[] = $json['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'];
}
for ( $i = 0; $i < count($images); $i++ ) {
$insta_feed[] = array(
'thumbnail' => $images[0][$i]['node']['thumbnail_resources'][0]['src'],
'small' => $images[0][$i]['node']['thumbnail_resources'][1]['src'],
'medium' => $images[0][$i]['node']['thumbnail_resources'][2]['src'],
'large' => $images[0][$i]['node']['thumbnail_resources'][4]['src'],
'original' => $images[0][$i]['node']['display_url'],
'link' => trailingslashit( '//instagram.com/p/' . $images[0][$i]['node']['shortcode'] ),
'caption' => $images[0][$i]['node']['edge_media_to_caption']['edges'][0]['node']['text']
);
}
if ( !empty( $insta_feed ) ) {
return ( $number ) ? array_slice( $insta_feed, 0, $number ) : $insta_feed;
}
}
工作正常时,在foreach
中,我有一个$a
是DOMElement Object
,我可以浏览该图像以获取图像URL。
当它不起作用时,我的$a
如下所示:
DOMElement Object ( [tagName] => script [schemaTypeInfo] => [nodeName] => script [nodeValue] => (function(){ function normalizeError(err) { var errorInfo = err.error || {}; var getConfigProp = function(propName, defaultValueIfNotTruthy) { var propValue = window._sharedData && window._sharedData[propName]; return propValue ? propValue : defaultValueIfNotTruthy; }; return { line: err.line || errorInfo.message || 0, column: err.column || 0, name: 'InitError', message: err.message || errorInfo.message || '', script: errorInfo.script || '', stack: errorInfo.stackTrace || errorInfo.stack || '', timestamp: Date.now(), ref: window.location.href, deployment_stage: getConfigProp('deployment_stage', ''), is_canary: getConfigProp('is_canary', false), rollout_hash: getConfigProp('rollout_hash', ''), is_prerelease: window.__PRERELEASE__ || false, bundle_variant: getConfigProp('bundle_variant', null), request_url: err.url || window.location.href, response_status_code: errorInfo.statusCode || 0 } } window.addEventListener('load', function(){ if (window.__bufferedErrors && window.__bufferedErrors.length) { if (window.caches && window.caches.keys && window.caches.delete) { window.caches.keys().then(function(keys) { keys.forEach(function(key) { window.caches.delete(key) }) }) } window.__bufferedErrors.map(function(error) { return normalizeError(error) }).forEach(function(normalizedError) { var request = new XMLHttpRequest(); request.open('POST', '/client_error/', true); request.setRequestHeader('Content-Type', 'application/json; charset=utf-8'); request.send(JSON.stringify(normalizedError)); }) } }) }()); [nodeType] => 1 [parentNode] => (object value omitted) [childNodes] => (object value omitted) [firstChild] => (object value omitted) [lastChild] => (object value omitted) [previousSibling] => (object value omitted) [nextSibling] => [attributes] => (object value omitted) [ownerDocument] => (object value omitted) [namespaceURI] => [prefix] => [localName] => script [baseURI] => [textContent] => (function(){ function normalizeError(err) { var errorInfo = err.error || {}; var getConfigProp = function(propName, defaultValueIfNotTruthy) { var propValue = window._sharedData && window._sharedData[propName]; return propValue ? propValue : defaultValueIfNotTruthy; }; return { line: err.line || errorInfo.message || 0, column: err.column || 0, name: 'InitError', message: err.message || errorInfo.message || '', script: errorInfo.script || '', stack: errorInfo.stackTrace || errorInfo.stack || '', timestamp: Date.now(), ref: window.location.href, deployment_stage: getConfigProp('deployment_stage', ''), is_canary: getConfigProp('is_canary', false), rollout_hash: getConfigProp('rollout_hash', ''), is_prerelease: window.__PRERELEASE__ || false, bundle_variant: getConfigProp('bundle_variant', null), request_url: err.url || window.location.href, response_status_code: errorInfo.statusCode || 0 } } window.addEventListener('load', function(){ if (window.__bufferedErrors && window.__bufferedErrors.length) { if (window.caches && window.caches.keys && window.caches.delete) { window.caches.keys().then(function(keys) { keys.forEach(function(key) { window.caches.delete(key) }) }) } window.__bufferedErrors.map(function(error) { return normalizeError(error) }).forEach(function(normalizedError) { var request = new XMLHttpRequest(); request.open('POST', '/client_error/', true); request.setRequestHeader('Content-Type', 'application/json; charset=utf-8'); request.send(JSON.stringify(normalizedError)); }) } }) }()); )
我真的不明白。形式localhost
的运作方式始终像是一种魅力。在一个具有某些帐户的实时网站上,我得到了DOMElement Object
,但没有任何有趣的数据。
这种情况大多发生在“经过验证的”帐户中。
有人可以帮我解决这个小挑战吗?