Instagram使用端点进行刮擦需要对所有请求进行身份验证

时间:2018-05-04 16:10:19

标签: python-3.x web-scraping md5 instagram-api hashlib

如你所知,Instagram宣布他们本月改变了他们的端点apis。 看起来像剑桥Analytica Instagram改变了他们的端点格式,并要求所有请求登录用户会话.....

不确定哪些端点需要更新,但我是专门使用媒体/评论端点,现在如下:

Media OLD:

https://www.instagram.com/graphql/query/?query_id=17888483320059182&id= {0}&安培;第一= 100安培;后= {1}

Media NEW:

https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%2221575514%22%2C%22first%22%3A12%2C%22after%22%3A%22AQAHXuz1DPmI3FFLOzy5iKEhHOLKw3lt_ozVR40TphSdns0Vp5j_ZEU6Qj0CW6IqNtVGO5pmLCQoX0Y8RVS9aRTT2lWPp6vf8vFqjo1QfxRYmA%22%7D

我用来避免此问题的脚本如下:

#!/usr/bin/env python3
import requests
import urllib.parse
import hashlib
import json

#CHROME_UA = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
CHROME_UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

def getSession_old(rhx_gis, csrf_token, variables):
    """ Get session preconfigured with required headers & cookies. """
    #"rhx_gis:csfr_token:user_agent:variables"
    print(variables)
    values = "%s:%s:%s:%s" % (
            rhx_gis,
            csrf_token,
            CHROME_UA,
            variables)
    x_instagram_gis = hashlib.md5(values.encode()).hexdigest()

    session = requests.Session()
    session.headers = {
            'user-agent': CHROME_UA,
            'x-instagram-gis': x_instagram_gis
            }
    print(x_instagram_gis)
    session.cookies.set('ig_pr', '2')
    session.cookies.set('csrftoken', csrf_token)

    return session


def getSession(rhx_gis, variables):
    """ Get session preconfigured with required headers & cookies. """
    #"rhx_gis:csfr_token:user_agent:variables"
    values = "%s:%s" % (
            rhx_gis,
            variables)
    x_instagram_gis = hashlib.md5(values.encode()).hexdigest()

    session = requests.Session()
    session.headers = {
            'x-instagram-gis': x_instagram_gis
            }

    return session


if __name__ == '__main__':
    session = requests.Session()
    session.headers = { 'user-agent': CHROME_UA }
    response = session.get("https://www.instagram.com/selenagomez")
    data = json.loads(response.text.split("window._sharedData = ")[1].split(";</script>")[0])
    csrf = data['config']['csrf_token']
    rhx_gis = data['rhx_gis']
    variables = '{"id":"460563723","first":10,"after":"AQBf8puhlt8nU2JzmYdMMTuH0FbMgUM1fnIOZIH7n94DM4VLWkVILUAKVB-5dqvxQEI-Wd0ttlEDzimaaqwC98jccQaDQT4tSF56c_NlWi_shg"}'
    session = getSession(rhx_gis, variables)

    query_hash = '42323d64886122307be10013ad2dcc44'
    encoded_vars = urllib.parse.quote(variables, safe='"')
    url = 'https://www.instagram.com/graphql/query/?query_hash=%s&variables=%s' % (query_hash, encoded_vars)
    print(url)
    print(session.get(url).text)

我确信这个脚本在11天之前运行良好,但现在不能正常工作。 有没有人知道如何在没有验证的情况下获取用户帖子的解决方案?

0 个答案:

没有答案